W1106 17:44:37.177000 2264496 site-packages/torch/distributed/run.py:792] 
W1106 17:44:37.177000 2264496 site-packages/torch/distributed/run.py:792] *****************************************
W1106 17:44:37.177000 2264496 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1106 17:44:37.177000 2264496 site-packages/torch/distributed/run.py:792] *****************************************
[2025-11-06 17:44:39,079] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:39,087] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:39,092] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:39,092] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-11-06 17:44:42,064] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 17:44:42,065] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 17:44:42,066] [INFO] [comm.py:652:init_distributed] cdb=None
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-11-06 17:44:42,245] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 17:44:42,246] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
11/06/2025 17:44:42 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 17:44:42 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 17:44:42 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 17:44:42 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 17:44:42 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/runs/Nov06_17-44-42_gpu-lg-cmc-h-h200-0964.host.h.pjlab.org.cn,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=5000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=2,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.05,
)
11/06/2025 17:44:42 - INFO - __main__ - Loading Tokenizer: /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified
[INFO|tokenization_utils_base.py:2025] 2025-11-06 17:44:42,408 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 17:44:42,408 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2025] 2025-11-06 17:44:42,408 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 17:44:42,408 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 17:44:42,408 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 17:44:42,408 >> loading file tokenizer.json
[WARNING|logging.py:314] 2025-11-06 17:44:42,429 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-11-06 17:44:42,443 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-11-06 17:44:42,459 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
tensor([[2.0689e-36, 3.9969e-21, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]])
tensor([[ 2.6000e+01, -6.1699e+17,  3.2433e+38,  0.0000e+00,  5.0729e+19,
          8.1351e-31,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00]])
tensor([[5.0822e-20, 7.9517e-28, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]])
tensor([[ 2.6000e+01, -6.1675e-21,  2.5388e+38,  0.0000e+00,  3.5265e-37,
          1.4594e-28,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00]])
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]])
[WARNING|logging.py:314] 2025-11-06 17:44:42,544 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
11/06/2025 17:44:42 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2025-11-06 17:44:42,547 >> loading configuration file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/config.json
[INFO|configuration_utils.py:792] 2025-11-06 17:44:42,548 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "_name_or_path": "/mnt/shared-storage-user/jiaziheng/LMMS/internvl-pretrain-10_9_clip",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 3584,
  "image_fold": null,
  "llm_config": {
    "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151643,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 3584,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 18944,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "min_length": 0,
    "model_type": "qwen2",
    "moe_config": null,
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 28,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 28,
    "num_key_value_heads": 4,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
      "AutoModel": "modeling_intern_vit.InternVisionModel"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "capacity_factor": 1.2,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "eval_capacity_factor": 1.4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "laux_allreduce": "all_nodes",
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "moe_coeff_ratio": 0.5,
    "moe_intermediate_size": 768,
    "moe_output_scale": 4.0,
    "no_repeat_ngram_size": 0,
    "noisy_gate_policy": "RSample_before",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_experts": 8,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "num_routed_experts": 4,
    "num_shared_experts": 4,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "shared_expert_intermediate_size": 3072,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true,
    "use_moe": false,
    "use_residual": true,
    "use_rts": false,
    "use_weighted_residual": false
  }
}

11/06/2025 17:44:42 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3473] 2025-11-06 17:44:42,549 >> loading weights file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/model.safetensors
[INFO|modeling_utils.py:1426] 2025-11-06 17:44:42,565 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2025-11-06 17:44:42,565 >> Generate config GenerationConfig {}

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor([[ 7.6407e-38, -1.2793e-01,         nan,  0.0000e+00,  2.0689e-36,
          4.5298e-31,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  1.1884e+29,  5.8548e-32,  0.0000e+00,  0.0000e+00,
          4.3163e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,  1.8367e-40,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  1.2037e-35,  1.1710e-31,
          0.0000e+00,  0.0000e+00,  1.8367e-40,  0.0000e+00,  0.0000e+00,
          0.0000e+00, -3.2741e-32,  1.4023e-33,  0.0000e+00,  0.0000e+00,
         -3.2741e-32,  1.6491e-33,  0.0000e+00,  0.0000e+00, -1.1250e-22,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  8.9999e-39,  0.0000e+00,
          0.0000e+00,  0.0000e+00, -2.9387e-39,  3.3126e-32,  0.0000e+00,
          0.0000e+00,  8.9999e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          9.1835e-41,  0.0000e+00,  0.0000e+00,  0.0000e+00,  1.6263e-19,
          4.2756e-32,  0.0000e+00,  0.0000e+00,  9.1835e-41,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  3.7748e-14,  4.9591e-39,  0.0000e+00,
          0.0000e+00,  7.5495e-14,  4.9591e-39,  0.0000e+00,  0.0000e+00,
         -1.1250e-22,  0.0000e+00,  0.0000e+00,  0.0000e+00,  8.9080e-39,
          0.0000e+00,  0.0000e+00,  0.0000e+00, -9.9262e-24,  1.1710e-31,
          0.0000e+00,  0.0000e+00,  8.9080e-39,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  9.1835e-41,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         -1.8198e-23,  1.1710e-31,  0.0000e+00,  0.0000e+00,  9.1835e-41,
          0.0000e+00,  0.0000e+00,  0.0000e+00, -7.3014e+11,  2.4061e-38,
          0.0000e+00,  0.0000e+00, -1.1682e+13,  2.4061e-38,  0.0000e+00,
          0.0000e+00, -1.1250e-22,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          8.9080e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,  2.8587e+13,
          1.1710e-31,  0.0000e+00,  0.0000e+00],
        [ 8.9080e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,  9.1835e-41,
          0.0000e+00,  0.0000e+00,  0.0000e+00, -4.9846e+36,  3.9674e-32,
          0.0000e+00,  0.0000e+00,  9.1835e-41,  0.0000e+00,  0.0000e+00,
          0.0000e+00, -8.8269e+35,  7.8979e-39,  0.0000e+00,  0.0000e+00,
         -3.5308e+36,  7.8979e-39,  0.0000e+00,  0.0000e+00, -1.1250e-22,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  9.6427e-39,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  1.6941e-21,  1.1710e-31,  0.0000e+00,
          0.0000e+00,  9.6427e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  7.3468e-40,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  3.9673e-38,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  4.1142e-38,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         -1.1663e-22,  0.0000e+00,  0.0000e+00,  0.0000e+00,  8.9080e-39,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  1.3642e-12,  1.1633e-31,
          0.0000e+00,  0.0000e+00,  8.9080e-39,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  9.1835e-41,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          7.2000e+01,  3.5052e-32,  0.0000e+00,  0.0000e+00,  9.1835e-41,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  3.1359e+21,  7.1632e-39,
          0.0000e+00,  0.0000e+00,  1.2544e+22,  7.1632e-39,  0.0000e+00,
          0.0000e+00, -1.1250e-22,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          5.6295e+14, -9.8154e-37,         nan,  0.0000e+00,  1.1259e+15,
         -9.8154e-37,         nan,  0.0000e+00,  3.3777e+15, -9.8154e-37,
                 nan,  0.0000e+00,  4.5036e+15, -9.8154e-37,         nan,
          0.0000e+00,  4.0408e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         -4.8104e+11,  5.8163e-32,  0.0000e+00,  0.0000e+00,  4.0408e-39,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 9.1835e-41,  0.0000e+00,  0.0000e+00,  0.0000e+00, -4.4373e-31,
          5.8933e-32,  0.0000e+00,  0.0000e+00,  3.9489e-39,  0.0000e+00,
          0.0000e+00,  0.0000e+00, -8.0779e-28,  4.3911e-32,  0.0000e+00,
          0.0000e+00,  3.9489e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          9.1835e-41,  0.0000e+00,  0.0000e+00,  0.0000e+00, -6.1035e-05,
          5.8548e-32,  0.0000e+00,  0.0000e+00,  9.1835e-41,  0.0000e+00,
          0.0000e+00,  0.0000e+00, -3.9581e-08,  1.9145e-25,  0.0000e+00,
          0.0000e+00, -2.5940e-03,  1.9145e-25,  0.0000e+00,  0.0000e+00,
         -1.1250e-22,  0.0000e+00,  0.0000e+00,  0.0000e+00,  4.0408e-39,
          0.0000e+00,  0.0000e+00,  0.0000e+00, -4.4429e-27,  5.5467e-32,
          0.0000e+00,  0.0000e+00,  4.0408e-39,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  9.1835e-41,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         -6.3803e+37,  1.1633e-31,  0.0000e+00,  0.0000e+00,  9.1835e-41,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  1.4062e-22,  2.7271e-36,
          0.0000e+00,  0.0000e+00,  9.2157e-18,  2.7271e-36,  0.0000e+00,
          0.0000e+00, -1.1250e-22,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          4.2244e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,  2.2518e+15,
          1.1864e-31,  0.0000e+00,  0.0000e+00,  4.2244e-39,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  1.8367e-40,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  3.3777e+15,  1.1864e-31,  0.0000e+00,  0.0000e+00,
          1.8367e-40,  0.0000e+00,  0.0000e+00,  0.0000e+00,  3.2741e-32,
          4.0827e-19,  0.0000e+00,  0.0000e+00,  3.2741e-32,  4.9128e-19,
          0.0000e+00,  0.0000e+00, -1.1250e-22,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  4.6423e+26, -9.8154e-37,         nan,  0.0000e+00,
          9.2846e+26, -9.8154e-37,         nan,  0.0000e+00,  2.4759e+27,
         -9.8154e-37,         nan,  0.0000e+00],
        [ 3.7138e+27, -9.8154e-37,         nan,  0.0000e+00,  1.4855e+28,
         -9.8154e-37,         nan,  0.0000e+00,  3.9614e+28, -9.8154e-37,
                 nan,  0.0000e+00,  1.1884e+29, -9.8154e-37,         nan,
          0.0000e+00,  1.5846e+29, -9.8154e-37,         nan,  0.0000e+00,
          6.3383e+29, -9.8154e-37,         nan,  0.0000e+00,  4.1326e-39,
          0.0000e+00,  0.0000e+00,  0.0000e+00, -2.1684e-19,  2.7926e-32,
          0.0000e+00,  0.0000e+00,  4.1326e-39,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  1.8367e-40,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         -1.8750e-01,  2.3111e-32,  0.0000e+00,  0.0000e+00,  1.8367e-40,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  6.0396e-13,  9.3086e-29,
          0.0000e+00,  0.0000e+00,  6.0396e-13,  1.8617e-28,  0.0000e+00,
          0.0000e+00, -1.1250e-22,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          3.2452e+32, -9.8154e-37,         nan,  0.0000e+00,  9.7356e+32,
         -9.8154e-37,         nan,  0.0000e+00,  1.2981e+33, -9.8154e-37,
                 nan,  0.0000e+00,  2.5961e+33, -9.8154e-37,         nan,
          0.0000e+00,  3.8942e+33, -9.8154e-37,         nan,  0.0000e+00,
          1.0385e+34, -9.8154e-37,         nan,  0.0000e+00,  2.0769e+34,
         -9.8154e-37,         nan,  0.0000e+00,  6.2308e+34, -9.8154e-37,
                 nan,  0.0000e+00,  1.6615e+35, -9.8154e-37,         nan,
          0.0000e+00,  2.4923e+35, -9.8154e-37,         nan,  0.0000e+00,
          6.6461e+35, -9.8154e-37,         nan,  0.0000e+00,  9.9692e+35,
         -9.8154e-37,         nan,  0.0000e+00,  1.9938e+36, -9.8154e-37,
                 nan,  0.0000e+00,  5.3169e+36, -9.8154e-37,         nan,
          0.0000e+00,  7.9754e+36, -9.8154e-37,         nan,  0.0000e+00,
          3.1901e+37, -9.8154e-37,         nan,  0.0000e+00,  8.5071e+37,
         -9.8154e-37,         nan,  0.0000e+00],
        [ 2.5521e+38, -9.8154e-37,         nan,  0.0000e+00, -3.0000e+00,
         -9.8154e-37,         nan,  0.0000e+00, -4.0000e+00, -9.8154e-37,
                 nan,  0.0000e+00, -1.2000e+01, -9.8154e-37,         nan,
          0.0000e+00, -3.2000e+01, -9.8154e-37,         nan,  0.0000e+00,
         -4.8000e+01, -9.8154e-37,         nan,  0.0000e+00, -1.2800e+02,
         -9.8154e-37,         nan,  0.0000e+00, -1.9200e+02, -9.8154e-37,
                 nan,  0.0000e+00, -3.8400e+02, -9.8154e-37,         nan,
          0.0000e+00, -7.6800e+02, -9.8154e-37,         nan,  0.0000e+00,
         -1.5360e+03, -9.8154e-37,         nan,  0.0000e+00, -6.1440e+03,
         -9.8154e-37,         nan,  0.0000e+00, -8.1920e+03, -9.8154e-37,
                 nan,  0.0000e+00, -3.2768e+04, -9.8154e-37,         nan,
          0.0000e+00, -6.5536e+04, -9.8154e-37,         nan,  0.0000e+00,
         -9.8304e+04, -9.8154e-37,         nan,  0.0000e+00, -1.9661e+05,
         -9.8154e-37,         nan,  0.0000e+00, -3.9322e+05, -9.8154e-37,
                 nan,  0.0000e+00, -5.2429e+05, -9.8154e-37,         nan,
          0.0000e+00,  3.8571e-39,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         -1.1285e-36,  3.6207e-32,  0.0000e+00,  0.0000e+00,  3.8571e-39,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  9.1835e-41,  0.0000e+00,
          0.0000e+00,  0.0000e+00, -1.1444e-05,  3.4859e-32,  0.0000e+00,
          0.0000e+00,  9.1835e-41,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          7.3014e+11,  1.6331e-18,  0.0000e+00,  0.0000e+00,  4.7851e+16,
          1.6331e-18,  0.0000e+00,  0.0000e+00, -1.1250e-22,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  9.1835e-41,  0.0000e+00,  0.0000e+00,
          0.0000e+00, -1.3469e+31,  2.6129e-17,  0.0000e+00,  0.0000e+00,
         -8.8269e+35,  2.6129e-17,  0.0000e+00,  0.0000e+00, -1.1520e-19,
         -1.0532e-35,  0.0000e+00,  0.0000e+00]])
[INFO|modeling_utils.py:4350] 2025-11-06 17:44:43,426 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2025-11-06 17:44:43,426 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2025-11-06 17:44:43,429 >> loading configuration file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/generation_config.json
[INFO|configuration_utils.py:826] 2025-11-06 17:44:43,429 >> Generate config GenerationConfig {}

11/06/2025 17:44:43 - INFO - __main__ - Finished
11/06/2025 17:44:43 - INFO - __main__ - model.config.force_image_size: 448
11/06/2025 17:44:43 - INFO - __main__ - data_args.force_image_size: 448
11/06/2025 17:44:43 - INFO - __main__ - model.config.vision_config.image_size: 448
11/06/2025 17:44:43 - INFO - __main__ - [Dataset] num_image_token: 256
11/06/2025 17:44:43 - INFO - __main__ - [Dataset] dynamic_image_size: True
11/06/2025 17:44:43 - INFO - __main__ - [Dataset] use_thumbnail: True
11/06/2025 17:44:43 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
11/06/2025 17:44:43 - INFO - __main__ - Formatting inputs...Skip in lazy mode
Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
11/06/2025 17:44:43 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 28056
11/06/2025 17:44:43 - INFO - __main__ - quality.0.weight
11/06/2025 17:44:43 - INFO - __main__ - quality.0.bias
11/06/2025 17:44:43 - INFO - __main__ - quality.1.weight
11/06/2025 17:44:43 - INFO - __main__ - quality.1.bias
Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
[INFO|trainer.py:571] 2025-11-06 17:44:43,677 >> Using auto half precision backend
[2025-11-06 17:44:43,822] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2025-11-06 17:44:43,822] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-11-06 17:44:45,382] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...

Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.14368486404418945 seconds
[2025-11-06 17:44:45,527] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-11-06 17:44:45,527] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-11-06 17:44:45,528] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-11-06 17:44:45,528] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-11-06 17:44:45,528] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2025-11-06 17:44:45,528] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2025-11-06 17:44:45,528] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2025-11-06 17:44:45,528] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-11-06 17:44:45,528] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
Loading extension module fused_adam...Loading extension module fused_adam...

Time to load fused_adam op: 0.20165395736694336 seconds
Time to load fused_adam op: 0.20314788818359375 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.20111775398254395 seconds
[2025-11-06 17:44:45,940] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-11-06 17:44:45,940] [INFO] [utils.py:782:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.66 GB         Max_CA 1 GB 
[2025-11-06 17:44:45,942] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 703.05 GB, percent = 51.4%
[2025-11-06 17:44:46,041] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-11-06 17:44:46,041] [INFO] [utils.py:782:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.66 GB         Max_CA 1 GB 
[2025-11-06 17:44:46,043] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 703.05 GB, percent = 51.4%
[2025-11-06 17:44:46,043] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized
[2025-11-06 17:44:46,138] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-11-06 17:44:46,138] [INFO] [utils.py:782:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.66 GB         Max_CA 1 GB 
[2025-11-06 17:44:46,139] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 703.05 GB, percent = 51.4%
[2025-11-06 17:44:46,140] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-11-06 17:44:46,140] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2025-11-06 17:44:46,140] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fbc8f521750>
[2025-11-06 17:44:46,140] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2025-11-06 17:44:46,141] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fbc824b3cd0>
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-11-06 17:44:46,141] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   optimizer_name ............... adamw
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.05}
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   train_batch_size ............. 8
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... True
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   world_size ................... 4
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  False
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-11-06 17:44:46,142] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 1
[2025-11-06 17:44:46,142] [INFO] [config.py:989:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 2e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.05
        }
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 8, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2025-11-06 17:44:46,142 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-11-06 17:44:46,142 >>   Num examples = 28,056
[INFO|trainer.py:1723] 2025-11-06 17:44:46,142 >>   Num Epochs = 1
[INFO|trainer.py:1724] 2025-11-06 17:44:46,142 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-11-06 17:44:46,142 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1728] 2025-11-06 17:44:46,142 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1729] 2025-11-06 17:44:46,142 >>   Total optimization steps = 3,507
[INFO|trainer.py:1730] 2025-11-06 17:44:46,143 >>   Number of trainable parameters = 1,606,405
  0%|          | 0/3507 [00:00<?, ?it/s][2025-11-06 17:44:48,696] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:48,700] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:48,700] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:48,700] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:53,733] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:53,734] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:53,734] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:53,734] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:58,686] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:58,686] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:58,687] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:44:58,761] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:45:03,746] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:45:03,779] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:45:03,788] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 17:45:03,789] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
tensor([[-0.8633, -0.8984, -0.8398, -0.8438, -0.8477]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[WARNING|modeling_utils.py:1124] 2025-11-06 17:45:11,071 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
tensor([[-0.6406, -0.6641, -0.6250, -0.6250, -0.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8359, -0.8711, -0.8125, -0.8164, -0.8242]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[WARNING|modeling_utils.py:1124] 2025-11-06 17:45:14,717 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
tensor([[0.1797, 0.1895, 0.1816, 0.1787, 0.1865]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[WARNING|modeling_utils.py:1124] 2025-11-06 17:45:14,801 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
tensor([[-0.7578, -0.7891, -0.7422, -0.7422, -0.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0000, -1.0469, -0.9766, -0.9766, -0.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[2.5000, 2.6250, 2.4531, 2.4375, 2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:45:21,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 559.46 | bwd_microstep: 16.47 | bwd_inner_microstep: 16.33 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.04
[WARNING|modeling_utils.py:1124] 2025-11-06 17:45:21,478 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
tensor([[-0.8828, -0.9219, -0.8555, -0.8633, -0.8711]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:45:21,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.70 | optimizer_gradients: 0.31 | optimizer_step: 0.52
[2025-11-06 17:45:21,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.43 | bwd_microstep: 2.05 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1.07 | step_microstep: 77.87
[2025-11-06 17:45:21,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 708.88 | bwd: 18.52 | bwd_inner: 17.27 | bwd_allreduce: 1.12 | step: 77.91
  0%|          | 1/3507 [00:35<34:36:47, 35.54s/it]                                                   {'loss': 1.6055, 'learning_rate': 1.886792452830189e-07, 'epoch': 0.0}
  0%|          | 1/3507 [00:35<34:36:47, 35.54s/it]tensor([[-0.1230, -0.1260, -0.1172, -0.1206, -0.1162]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5469, -1.6094, -1.5078, -1.5078, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2656, -0.2812, -0.2578, -0.2578, -0.2598]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:45:21,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.33 | bwd_microstep: 0.57 | bwd_inner_microstep: 0.48 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.02
tensor([[0.2578, 0.2754, 0.2617, 0.2490, 0.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.9492, 0.9922, 0.9375, 0.9297, 0.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0952, -0.0928, -0.0830, -0.0938, -0.0854]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.1172, 0.1240, 0.1123, 0.1133, 0.1177]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-0.1289, -0.1279, -0.1187, -0.1245, -0.1206]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:45:37,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:45:37,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 72.74 | bwd_microstep: 3913.40 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 3912.37 | step_microstep: 1.99
[2025-11-06 17:45:37,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 251.08 | bwd: 3913.96 | bwd_inner: 1.45 | bwd_allreduce: 3912.40 | step: 2.02
  0%|          | 2/3507 [00:51<23:27:24, 24.09s/it]                                                   {'loss': 1.6064, 'learning_rate': 3.773584905660378e-07, 'epoch': 0.0}
  0%|          | 2/3507 [00:51<23:27:24, 24.09s/it]tensor([[0.1387, 0.1504, 0.1426, 0.1338, 0.1436]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3164, -0.3281, -0.3105, -0.3066, -0.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0000, -1.0547, -0.9766, -0.9805, -0.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1797, 0.1895, 0.1807, 0.1768, 0.1885]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4746, -0.4961, -0.4590, -0.4688, -0.4668]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.5703, -0.5898, -0.5469, -0.5508, -0.5586]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.4824, 0.5039, 0.4746, 0.4727, 0.4824]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:45:38,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.51 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.8398, -0.8789, -0.8203, -0.8203, -0.8320]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:45:38,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:45:38,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 394.45 | bwd_microstep: 1.60 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.44
[2025-11-06 17:45:38,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 571.99 | bwd: 2.48 | bwd_inner: 1.51 | bwd_allreduce: 0.83 | step: 1.52
  0%|          | 3/3507 [00:52<13:05:40, 13.45s/it]                                                   {'loss': 1.6006, 'learning_rate': 5.660377358490567e-07, 'epoch': 0.0}
  0%|          | 3/3507 [00:52<13:05:40, 13.45s/it]tensor([[-1.0547, -1.1016, -1.0312, -1.0312, -1.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9062, -0.9453, -0.8867, -0.8867, -0.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.6328, 1.7109, 1.6094, 1.6016, 1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:45:38,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.80 | bwd_microstep: 0.51 | bwd_inner_microstep: 0.43 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.05
tensor([[-0.4004, -0.4238, -0.3906, -0.3926, -0.3965]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8555, -0.8906, -0.8359, -0.8359, -0.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[1.2969, 1.3516, 1.2812, 1.2656, 1.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.5234, 0.5430, 0.5156, 0.5117, 0.5273]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.2090, 0.2227, 0.2100, 0.2080, 0.2100]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:45:44,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.24
[2025-11-06 17:45:44,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.18 | bwd_microstep: 602.65 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 601.52 | step_microstep: 1.92
[2025-11-06 17:45:44,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.97 | bwd: 603.17 | bwd_inner: 1.46 | bwd_allreduce: 601.56 | step: 1.98
  0%|          | 4/3507 [00:58<10:19:38, 10.61s/it]                                                   {'loss': 1.6094, 'learning_rate': 7.547169811320755e-07, 'epoch': 0.0}
  0%|          | 4/3507 [00:58<10:19:38, 10.61s/it]tensor([[1.3359, 1.3984, 1.3203, 1.3125, 1.3359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:45:45,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.10 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[1.5234, 1.5938, 1.5000, 1.4922, 1.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1113, 0.1177, 0.1143, 0.1089, 0.1162]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1221, 0.1289, 0.1270, 0.1250, 0.1309]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
tensor([[-0.2891, -0.3066, -0.2832, -0.2832, -0.2852]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1484, -1.2031, -1.1250, -1.1250, -1.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.0273, 0.0298, 0.0286, 0.0231, 0.0344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6875, -0.7148, -0.6680, -0.6719, -0.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:45:45,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 17:45:45,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.48 | bwd_microstep: 439.19 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 438.34 | step_microstep: 1.60
[2025-11-06 17:45:45,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.60 | bwd: 440.05 | bwd_inner: 1.53 | bwd_allreduce: 438.38 | step: 1.68
  0%|          | 5/3507 [00:59<6:53:03,  7.08s/it]                                                   {'loss': 1.6123, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.0}
  0%|          | 5/3507 [00:59<6:53:03,  7.08s/it]tensor([[-0.2344, -0.2393, -0.2227, -0.2285, -0.2256]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2393, 0.2520, 0.2393, 0.2383, 0.2432]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7500, -0.7891, -0.7383, -0.7305, -0.7461]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3594, -0.3711, -0.3457, -0.3535, -0.3496]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:45:46,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 338.42 | bwd_microstep: 201.07 | bwd_inner_microstep: 200.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.05
tensor([[0.0356, 0.0396, 0.0366, 0.0371, 0.0388]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2314, -0.2363, -0.2207, -0.2275, -0.2227]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[1.3281, 1.3906, 1.3047, 1.3047, 1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.4043, -0.4160, -0.3926, -0.3965, -0.3945]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:45:57,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 17:45:57,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.94 | bwd_microstep: 3710.79 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 3709.69 | step_microstep: 228.32
[2025-11-06 17:45:57,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 506.38 | bwd: 3911.86 | bwd_inner: 202.02 | bwd_allreduce: 3709.73 | step: 228.38
  0%|          | 6/3507 [01:11<8:22:52,  8.62s/it]                                                  {'loss': 1.6064, 'learning_rate': 1.1320754716981133e-06, 'epoch': 0.0}
  0%|          | 6/3507 [01:11<8:22:52,  8.62s/it]tensor([[-0.3086, -0.3184, -0.3008, -0.2988, -0.3008]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.2520, 0.2598, 0.2471, 0.2480, 0.2559]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1914, 0.2012, 0.1924, 0.1904, 0.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:45:57,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.52 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.3633, -0.3750, -0.3555, -0.3574, -0.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.7148, 0.7500, 0.7031, 0.6992, 0.7227]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1025, 0.1157, 0.1045, 0.1006, 0.1084]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7812, -1.8594, -1.7422, -1.7344, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2080, -0.2178, -0.2031, -0.2070, -0.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:45:57,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 17:45:57,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.46 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.80
[2025-11-06 17:45:57,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.01 | bwd: 2.44 | bwd_inner: 1.51 | bwd_allreduce: 0.82 | step: 1.87
  0%|          | 7/3507 [01:11<5:48:51,  5.98s/it]                                                  {'loss': 1.6055, 'learning_rate': 1.3207547169811322e-06, 'epoch': 0.0}
  0%|          | 7/3507 [01:11<5:48:51,  5.98s/it]tensor([[0.3242, 0.3418, 0.3223, 0.3145, 0.3320]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1118, -0.1089, -0.1094, -0.1108, -0.1060]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.0518, 0.0544, 0.0540, 0.0496, 0.0549]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2275, -0.2354, -0.2178, -0.2236, -0.2227]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:45:58,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.52 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.05
tensor([[-0.5273, -0.5469, -0.5078, -0.5117, -0.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1235, -0.1270, -0.1162, -0.1260, -0.1201]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1719, -1.2188, -1.1406, -1.1484, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.5625, -0.5859, -0.5469, -0.5508, -0.5547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:46:07,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.33
[2025-11-06 17:46:07,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.38 | bwd_microstep: 9438.95 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 9438.12 | step_microstep: 2.44
[2025-11-06 17:46:07,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 433.92 | bwd: 9439.63 | bwd_inner: 1.35 | bwd_allreduce: 9438.16 | step: 2.49
  0%|          | 8/3507 [01:21<7:01:43,  7.23s/it]                                                  {'loss': 1.6025, 'learning_rate': 1.509433962264151e-06, 'epoch': 0.0}
  0%|          | 8/3507 [01:21<7:01:43,  7.23s/it]tensor([[-0.3574, -0.3691, -0.3477, -0.3457, -0.3496]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-0.4766, -0.4941, -0.4648, -0.4629, -0.4707]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:46:07,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.83 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.5781, -1.6406, -1.5391, -1.5391, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0737, -0.0791, -0.0723, -0.0747, -0.0684]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3555, -0.3711, -0.3477, -0.3496, -0.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.2988, -0.3086, -0.2871, -0.2910, -0.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2656, -1.3203, -1.2344, -1.2344, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0359, 0.0415, 0.0405, 0.0354, 0.0417]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:46:08,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.32 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:46:08,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.18 | bwd_microstep: 160.03 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 158.96 | step_microstep: 4.10
[2025-11-06 17:46:08,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.02 | bwd: 160.86 | bwd_inner: 1.75 | bwd_allreduce: 158.99 | step: 4.17
  0%|          | 9/3507 [01:22<4:58:48,  5.13s/it]                                                  {'loss': 1.5986, 'learning_rate': 1.6981132075471698e-06, 'epoch': 0.0}
  0%|          | 9/3507 [01:22<4:58:48,  5.13s/it]tensor([[0.2520, 0.2695, 0.2520, 0.2451, 0.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0449, 0.0449, 0.0479, 0.0459, 0.0493]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4160, -0.4336, -0.4062, -0.4062, -0.4121]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1758, -0.1816, -0.1699, -0.1689, -0.1699]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.5234, -0.5469, -0.5078, -0.5117, -0.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2500, 0.2676, 0.2520, 0.2471, 0.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:46:08,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.17 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.18
tensor([[-0.6328, -0.6602, -0.6250, -0.6211, -0.6289]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.2930, -0.3008, -0.2871, -0.2852, -0.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:46:12,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 17:46:12,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.40 | bwd_microstep: 3007.09 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 3005.94 | step_microstep: 333.86
[2025-11-06 17:46:12,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 898.61 | bwd: 3008.06 | bwd_inner: 1.91 | bwd_allreduce: 3005.98 | step: 334.04
  0%|          | 10/3507 [01:26<4:43:51,  4.87s/it]                                                   {'loss': 1.6035, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.0}
  0%|          | 10/3507 [01:26<4:43:51,  4.87s/it]tensor([[-0.1953, -0.2070, -0.1875, -0.1895, -0.1895]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1084, -0.1104, -0.1006, -0.1001, -0.1025]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:46:12,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.62 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.3730, -0.3906, -0.3633, -0.3633, -0.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.4746, 0.4941, 0.4727, 0.4648, 0.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2656, -1.3203, -1.2344, -1.2344, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4238, -0.4434, -0.4102, -0.4062, -0.4160]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.1963, 0.2100, 0.1973, 0.1924, 0.2041]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7109, -0.7422, -0.6914, -0.6953, -0.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:46:12,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.13
[2025-11-06 17:46:12,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.98 | bwd_microstep: 10.46 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 9.52 | step_microstep: 1.31
[2025-11-06 17:46:12,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.62 | bwd: 11.28 | bwd_inner: 1.62 | bwd_allreduce: 9.55 | step: 1.38
  0%|          | 11/3507 [01:26<3:24:27,  3.51s/it]                                                   {'loss': 1.6006, 'learning_rate': 2.075471698113208e-06, 'epoch': 0.0}
  0%|          | 11/3507 [01:26<3:24:27,  3.51s/it]tensor([[-0.8477, -0.8828, -0.8281, -0.8242, -0.8398]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.5117, 0.5391, 0.5078, 0.5078, 0.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[1.0000, 1.0469, 0.9883, 0.9805, 1.0078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:46:13,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.48 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.05
tensor([[-0.3867, -0.3965, -0.3750, -0.3750, -0.3789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6680, -0.6953, -0.6484, -0.6523, -0.6602]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1167, 0.1270, 0.1157, 0.1162, 0.1206]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.5547, 0.5781, 0.5508, 0.5469, 0.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.6016, 0.6328, 0.5977, 0.5898, 0.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:46:18,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.25 | optimizer_step: 0.29
[2025-11-06 17:46:18,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.28 | bwd_microstep: 5335.80 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 5334.81 | step_microstep: 221.05
[2025-11-06 17:46:18,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.78 | bwd: 5336.54 | bwd_inner: 1.56 | bwd_allreduce: 5334.86 | step: 221.11
  0%|          | 12/3507 [01:32<4:07:06,  4.24s/it]                                                   {'loss': 1.6084, 'learning_rate': 2.2641509433962266e-06, 'epoch': 0.0}
  0%|          | 12/3507 [01:32<4:07:06,  4.24s/it]tensor([[0.0864, 0.0918, 0.0894, 0.0845, 0.0889]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.1523, 0.1631, 0.1562, 0.1494, 0.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:46:19,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.55 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[0.0640, 0.0693, 0.0679, 0.0635, 0.0718]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1416, 0.1553, 0.1465, 0.1396, 0.1475]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2285, -0.2373, -0.2207, -0.2227, -0.2207]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-0.5703, -0.5977, -0.5547, -0.5625, -0.5664]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.1855, -0.1934, -0.1758, -0.1777, -0.1807]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2266, -1.2734, -1.1875, -1.1953, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:46:19,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:46:19,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.14 | bwd_microstep: 204.16 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 202.97 | step_microstep: 1.51
[2025-11-06 17:46:19,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.72 | bwd: 205.04 | bwd_inner: 1.92 | bwd_allreduce: 203.00 | step: 1.57
  0%|          | 13/3507 [01:33<3:02:17,  3.13s/it]                                                   {'loss': 1.6045, 'learning_rate': 2.4528301886792453e-06, 'epoch': 0.0}
  0%|          | 13/3507 [01:33<3:02:17,  3.13s/it]tensor([[-1.7578, -1.8359, -1.7109, -1.7109, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.4219, 0.4395, 0.4219, 0.4121, 0.4238]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1826, -0.1846, -0.1738, -0.1807, -0.1748]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:46:19,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.55 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.0718, -0.0698, -0.0684, -0.0669, -0.0640]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0000, -2.0781, -1.9531, -1.9453, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.4785, -0.4961, -0.4668, -0.4629, -0.4707]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3770, -0.3887, -0.3574, -0.3613, -0.3652]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5703, -0.5938, -0.5547, -0.5547, -0.5664]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:46:20,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 17:46:20,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.71 | bwd_microstep: 624.35 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 623.26 | step_microstep: 1.69
[2025-11-06 17:46:20,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.29 | bwd: 625.14 | bwd_inner: 1.70 | bwd_allreduce: 623.30 | step: 1.77
  0%|          | 14/3507 [01:34<2:25:04,  2.49s/it]                                                   {'loss': 1.5918, 'learning_rate': 2.6415094339622644e-06, 'epoch': 0.0}
  0%|          | 14/3507 [01:34<2:25:04,  2.49s/it]tensor([[0.2344, 0.2461, 0.2324, 0.2402, 0.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.7266, 0.7578, 0.7188, 0.7148, 0.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.1245, -0.1299, -0.1138, -0.1196, -0.1162]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:46:20,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.06 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[0.3770, 0.4004, 0.3789, 0.3750, 0.3848]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4570, -0.4707, -0.4375, -0.4414, -0.4473]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7734, -0.8086, -0.7500, -0.7539, -0.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.0854, 0.0903, 0.0923, 0.0864, 0.0894]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5273, -0.5469, -0.5078, -0.5117, -0.5195]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:46:21,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 17:46:21,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.06 | bwd_microstep: 333.05 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 331.90 | step_microstep: 1.72
[2025-11-06 17:46:21,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.15 | bwd: 334.02 | bwd_inner: 1.95 | bwd_allreduce: 331.95 | step: 1.81
  0%|          | 15/3507 [01:35<1:55:35,  1.99s/it]                                                   {'loss': 1.6055, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.0}
  0%|          | 15/3507 [01:35<1:55:35,  1.99s/it]tensor([[-0.2773, -0.2812, -0.2617, -0.2617, -0.2695]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5664, -0.5938, -0.5508, -0.5469, -0.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7109, -1.7891, -1.6641, -1.6641, -1.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:46:21,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.08 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.3164, -0.3281, -0.3027, -0.3027, -0.3105]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.5977, 0.6250, 0.5977, 0.5859, 0.6055]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-0.7383, -0.7695, -0.7227, -0.7188, -0.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.1807, 0.1875, 0.1826, 0.1836, 0.1836]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9297, -0.9648, -0.9062, -0.8984, -0.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:46:35,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.23 | optimizer_step: 0.33
[2025-11-06 17:46:35,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 84.91 | bwd_microstep: 13775.17 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 13774.16 | step_microstep: 2.73
[2025-11-06 17:46:35,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.01 | bwd: 13775.96 | bwd_inner: 1.63 | bwd_allreduce: 13774.21 | step: 2.80
  0%|          | 16/3507 [01:49<5:27:19,  5.63s/it]                                                   {'loss': 1.5928, 'learning_rate': 3.018867924528302e-06, 'epoch': 0.0}
  0%|          | 16/3507 [01:49<5:27:19,  5.63s/it]tensor([[0.2334, 0.2432, 0.2393, 0.2383, 0.2363]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:46:35,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.77 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.05
tensor([[0.1406, 0.1484, 0.1455, 0.1445, 0.1455]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.5820, 0.6094, 0.5820, 0.5742, 0.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0859, -1.1328, -1.0547, -1.0469, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4961, -0.5117, -0.4805, -0.4727, -0.4902]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6250, -1.6875, -1.5859, -1.5781, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -3.5156, -3.2812, -3.2656, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0312, -1.0703, -0.9961, -0.9922, -1.0234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:46:35,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.11 | optimizer_step: 0.10
[2025-11-06 17:46:35,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.56 | bwd_microstep: 136.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 136.04 | step_microstep: 1.14
[2025-11-06 17:46:35,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 248.34 | bwd: 137.91 | bwd_inner: 1.73 | bwd_allreduce: 136.07 | step: 1.19
  0%|          | 17/3507 [01:49<3:56:01,  4.06s/it]                                                   {'loss': 1.585, 'learning_rate': 3.207547169811321e-06, 'epoch': 0.0}
  0%|          | 17/3507 [01:49<3:56:01,  4.06s/it]tensor([[0.9023, 0.9492, 0.8945, 0.8867, 0.9023]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4805, -0.5000, -0.4648, -0.4590, -0.4746]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5469, -0.5664, -0.5195, -0.5195, -0.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:46:35,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.58 | bwd_microstep: 0.60 | bwd_inner_microstep: 0.52 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.05
tensor([[-0.6680, -0.6953, -0.6406, -0.6367, -0.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2734, -0.2832, -0.2598, -0.2559, -0.2676]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4629, -0.4785, -0.4375, -0.4434, -0.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.9141, -0.9492, -0.8789, -0.8789, -0.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.4531, 1.5234, 1.4297, 1.4141, 1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:47:25,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.71 | optimizer_step: 0.64
[2025-11-06 17:47:25,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.95 | bwd_microstep: 49731.52 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 49730.67 | step_microstep: 5.16
[2025-11-06 17:47:25,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.55 | bwd: 49732.14 | bwd_inner: 1.25 | bwd_allreduce: 49730.75 | step: 5.24
  1%|          | 18/3507 [02:39<17:19:58, 17.88s/it]                                                    {'loss': 1.6016, 'learning_rate': 3.3962264150943395e-06, 'epoch': 0.01}
  1%|          | 18/3507 [02:39<17:19:58, 17.88s/it]tensor([[-0.6367, -0.6602, -0.6055, -0.6094, -0.6289]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0043, 0.0037, 0.0117, 0.0184, 0.0079]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:26,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.86 | bwd_microstep: 1.47 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.15
tensor([[-0.1533, -0.1523, -0.1426, -0.1396, -0.1465]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9492, -0.9844, -0.9258, -0.9102, -0.9414]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.0337, 0.0386, 0.0447, 0.0393, 0.0386]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1289, 0.1387, 0.1406, 0.1318, 0.1367]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.2930, -0.3008, -0.2715, -0.2754, -0.2852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.4570, 0.4766, 0.4609, 0.4512, 0.4590]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:47:26,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 17:47:26,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.31 | bwd_microstep: 156.56 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 155.40 | step_microstep: 1.96
[2025-11-06 17:47:26,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.19 | bwd: 158.02 | bwd_inner: 2.34 | bwd_allreduce: 155.46 | step: 2.12
  1%|          | 19/3507 [02:40<12:16:29, 12.67s/it]                                                    {'loss': 1.5977, 'learning_rate': 3.5849056603773586e-06, 'epoch': 0.01}
  1%|          | 19/3507 [02:40<12:16:29, 12.67s/it]tensor([[-0.0143, -0.0120,  0.0012, -0.0082, -0.0057]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[0.3711, 0.3945, 0.3633, 0.3711, 0.3711]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[3.4062, 3.5625, 3.3438, 3.3125, 3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:47:26,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.03 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.0972, -0.0957, -0.0840, -0.0825, -0.0894]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6055, -0.6289, -0.5859, -0.5781, -0.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[1.6094, 1.6797, 1.5859, 1.5625, 1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3730, -0.3848, -0.3633, -0.3535, -0.3730]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-1.0781, -1.1250, -1.0391, -1.0312, -1.0703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:47:27,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:47:27,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.67 | bwd_microstep: 332.51 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 330.89 | step_microstep: 1.89
[2025-11-06 17:47:27,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.72 | bwd: 333.55 | bwd_inner: 2.51 | bwd_allreduce: 330.92 | step: 1.96
  1%|          | 20/3507 [02:40<8:48:38,  9.10s/it]                                                    {'loss': 1.585, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.01}
  1%|          | 20/3507 [02:40<8:48:38,  9.10s/it]tensor([[0.4668, 0.4961, 0.4746, 0.4629, 0.4707]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.4141, 0.4395, 0.4199, 0.4121, 0.4180]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2793, 0.2910, 0.2891, 0.2832, 0.2832]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.6602, 0.6914, 0.6562, 0.6562, 0.6602]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:27,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.83 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[0.0228, 0.0228, 0.0280, 0.0356, 0.0262]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.0481, -0.0481, -0.0298, -0.0388, -0.0425]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.4609, -0.4785, -0.4453, -0.4336, -0.4551]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4082, -0.4219, -0.3789, -0.3887, -0.4023]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:47:29,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 17:47:29,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.58 | bwd_microstep: 1515.79 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1514.72 | step_microstep: 1.91
[2025-11-06 17:47:29,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.41 | bwd: 1516.74 | bwd_inner: 1.83 | bwd_allreduce: 1514.77 | step: 2.00
  1%|          | 21/3507 [02:43<6:51:49,  7.09s/it]                                                   {'loss': 1.6045, 'learning_rate': 3.962264150943396e-06, 'epoch': 0.01}
  1%|          | 21/3507 [02:43<6:51:49,  7.09s/it]tensor([[-1.7422, -1.8125, -1.6875, -1.6641, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9141, -0.9492, -0.8828, -0.8750, -0.9102]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9805, -1.0156, -0.9375, -0.9297, -0.9648]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:29,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.65 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.1689, -0.1758, -0.1553, -0.1494, -0.1670]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7852, -0.8125, -0.7500, -0.7461, -0.7773]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5938, -0.6211, -0.5664, -0.5625, -0.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4512, -0.4688, -0.4297, -0.4199, -0.4453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0073, -0.0045,  0.0120,  0.0042, -0.0013]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:30,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:47:30,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.38 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.51
[2025-11-06 17:47:30,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.06 | bwd: 2.70 | bwd_inner: 1.74 | bwd_allreduce: 0.84 | step: 1.60
  1%|          | 22/3507 [02:43<4:58:11,  5.13s/it]                                                   {'loss': 1.5811, 'learning_rate': 4.150943396226416e-06, 'epoch': 0.01}
  1%|          | 22/3507 [02:43<4:58:11,  5.13s/it]tensor([[-0.4414, -0.4570, -0.4180, -0.4141, -0.4336]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.6367, 0.6680, 0.6406, 0.6328, 0.6445]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2734, 0.2871, 0.2773, 0.2852, 0.2734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.6055, 0.6328, 0.6094, 0.6016, 0.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.9570, 1.0000, 0.9453, 0.9375, 0.9570]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.4062, 0.4199, 0.4082, 0.4102, 0.4043]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:31,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[1.2109, 1.2656, 1.1953, 1.1875, 1.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.0240, 0.0261, 0.0378, 0.0361, 0.0260]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:31,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 17:47:31,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.28 | bwd_microstep: 1.98 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.24
[2025-11-06 17:47:31,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.17 | bwd: 2.94 | bwd_inner: 1.84 | bwd_allreduce: 0.94 | step: 2.33
  1%|          | 23/3507 [02:45<3:54:28,  4.04s/it]                                                   {'loss': 1.6123, 'learning_rate': 4.339622641509435e-06, 'epoch': 0.01}
  1%|          | 23/3507 [02:45<3:54:28,  4.04s/it]tensor([[-0.1973, -0.2051, -0.1807, -0.1758, -0.1934]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:31,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.77 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.2246, -0.2305, -0.2012, -0.1982, -0.2168]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6484, -0.6680, -0.6133, -0.6055, -0.6367]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.0183, 0.0208, 0.0374, 0.0366, 0.0216]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4980, -0.5195, -0.4727, -0.4570, -0.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2852, -0.2988, -0.2617, -0.2559, -0.2793]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6172, -0.6406, -0.5859, -0.5781, -0.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.0869, 0.0928, 0.0913, 0.0957, 0.0830]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:47:33,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 17:47:33,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.49 | bwd_microstep: 2.12 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.93 | step_microstep: 1.98
[2025-11-06 17:47:33,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.27 | bwd: 2.90 | bwd_inner: 1.82 | bwd_allreduce: 0.95 | step: 2.05
  1%|          | 24/3507 [02:47<3:13:34,  3.33s/it]                                                   {'loss': 1.5908, 'learning_rate': 4.528301886792453e-06, 'epoch': 0.01}
  1%|          | 24/3507 [02:47<3:13:34,  3.33s/it]tensor([[-0.1699, -0.1777, -0.1514, -0.1426, -0.1660]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5430, -0.5703, -0.5117, -0.5078, -0.5352]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2969, -0.3027, -0.2715, -0.2656, -0.2910]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8633, -0.8945, -0.8203, -0.8125, -0.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.0311, 0.0312, 0.0442, 0.0518, 0.0315]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:34,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.20 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.0903, -0.0923, -0.0757, -0.0674, -0.0913]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.4805, 0.5039, 0.4883, 0.4785, 0.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.7305, 0.7656, 0.7266, 0.7227, 0.7305]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:47:34,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.22 | optimizer_step: 0.17
[2025-11-06 17:47:34,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 282.06 | bwd_microstep: 73.21 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 72.01 | step_microstep: 2.66
[2025-11-06 17:47:34,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 479.28 | bwd: 74.15 | bwd_inner: 1.96 | bwd_allreduce: 72.05 | step: 2.75
  1%|          | 25/3507 [02:48<2:37:36,  2.72s/it]                                                   {'loss': 1.5947, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.01}
  1%|          | 25/3507 [02:48<2:37:36,  2.72s/it]tensor([[-0.2988, -0.3105, -0.2812, -0.2715, -0.3008]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5430, -0.5625, -0.5156, -0.5000, -0.5352]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.2441, 0.2578, 0.2617, 0.2520, 0.2471]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:34,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.82 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.6055, -0.6289, -0.5664, -0.5625, -0.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.1426, -0.1494, -0.1172, -0.1177, -0.1348]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6250, -1.6953, -1.5703, -1.5469, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3535, -0.3672, -0.3184, -0.3262, -0.3496]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3262, -0.3379, -0.3027, -0.2949, -0.3223]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:36,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 17:47:36,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.29 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.40
[2025-11-06 17:47:36,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.13 | bwd: 3.17 | bwd_inner: 2.14 | bwd_allreduce: 0.89 | step: 2.49
  1%|          | 26/3507 [02:50<2:31:47,  2.62s/it]                                                   {'loss': 1.584, 'learning_rate': 4.905660377358491e-06, 'epoch': 0.01}
  1%|          | 26/3507 [02:50<2:31:47,  2.62s/it]tensor([[0.1934, 0.2031, 0.2041, 0.2148, 0.1963]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.1069, 0.1152, 0.1206, 0.1270, 0.1113]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1719, -1.2188, -1.1172, -1.1016, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:37,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.87 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[0.0610, 0.0618, 0.0737, 0.0811, 0.0640]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4492, -0.4707, -0.4297, -0.4082, -0.4453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[3.2188, 3.3750, 3.1562, 3.1250, 3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[1.0234, 1.0781, 1.0156, 1.0156, 1.0234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4141, -0.4355, -0.3887, -0.3730, -0.4082]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:47:37,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 17:47:37,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.51 | bwd_microstep: 55.36 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 54.26 | step_microstep: 2.35
[2025-11-06 17:47:37,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.41 | bwd: 56.19 | bwd_inner: 1.75 | bwd_allreduce: 54.30 | step: 2.43
  1%|          | 27/3507 [02:51<1:53:55,  1.96s/it]                                                   {'loss': 1.6084, 'learning_rate': 5.09433962264151e-06, 'epoch': 0.01}
  1%|          | 27/3507 [02:51<1:53:55,  1.96s/it]tensor([[-0.0047, -0.0006,  0.0157,  0.0179,  0.0003]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7617, -0.7930, -0.7227, -0.7031, -0.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.2793, 0.2930, 0.2871, 0.2969, 0.2793]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:37,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.48 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.6953, -0.7227, -0.6484, -0.6445, -0.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5938, -1.6562, -1.5234, -1.5078, -1.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[2.0000, 2.0938, 1.9609, 1.9531, 2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6172, -0.6484, -0.5820, -0.5742, -0.6133]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7656, -0.7969, -0.7188, -0.7109, -0.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:40,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:47:40,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.08 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.98 | step_microstep: 39.15
[2025-11-06 17:47:40,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 513.55 | bwd: 2.95 | bwd_inner: 1.82 | bwd_allreduce: 1.01 | step: 39.22
  1%|          | 28/3507 [02:54<2:11:17,  2.26s/it]                                                   {'loss': 1.5898, 'learning_rate': 5.283018867924529e-06, 'epoch': 0.01}
  1%|          | 28/3507 [02:54<2:11:17,  2.26s/it]tensor([[-1.1094, -1.1641, -1.0703, -1.0312, -1.1016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1279, -0.1338, -0.1011, -0.0952, -0.1201]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.4492, 0.4688, 0.4551, 0.4570, 0.4492]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4395, -0.4570, -0.4004, -0.4023, -0.4316]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:40,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.70 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-0.1611, -0.1670, -0.1396, -0.1245, -0.1533]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2695, -0.2832, -0.2451, -0.2334, -0.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0156, -1.0625, -0.9688, -0.9570, -1.0078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-0.5820, -0.6094, -0.5547, -0.5352, -0.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:47:40,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 17:47:40,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.15 | bwd_microstep: 47.58 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 46.01 | step_microstep: 2.76
[2025-11-06 17:47:40,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.87 | bwd: 48.48 | bwd_inner: 2.24 | bwd_allreduce: 46.05 | step: 2.86
  1%|          | 29/3507 [02:54<1:39:34,  1.72s/it]                                                   {'loss': 1.5928, 'learning_rate': 5.4716981132075475e-06, 'epoch': 0.01}
  1%|          | 29/3507 [02:54<1:39:34,  1.72s/it]tensor([[-1.0391, -1.0781, -0.9844, -0.9648, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.0562, 0.0547, 0.0684, 0.0801, 0.0569]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.4512, 0.4688, 0.4570, 0.4629, 0.4512]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:40,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.60 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.3398, -0.3516, -0.3047, -0.3066, -0.3320]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9961, -1.0312, -0.9375, -0.9297, -0.9883]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.2969, 0.3125, 0.3203, 0.3086, 0.2988]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1953, -1.2422, -1.1328, -1.1172, -1.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.4629, -0.4824, -0.4199, -0.4219, -0.4551]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:47:42,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:47:42,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.22 | bwd_microstep: 167.13 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 166.26 | step_microstep: 1.84
[2025-11-06 17:47:42,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.86 | bwd: 167.80 | bwd_inner: 1.35 | bwd_allreduce: 166.30 | step: 1.92
  1%|          | 30/3507 [02:55<1:32:55,  1.60s/it]                                                   {'loss': 1.584, 'learning_rate': 5.660377358490566e-06, 'epoch': 0.01}
  1%|          | 30/3507 [02:55<1:32:55,  1.60s/it]tensor([[-0.0248, -0.0232,  0.0029,  0.0007, -0.0209]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2109, -1.2578, -1.1484, -1.1250, -1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:42,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.74 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.9883, -1.0312, -0.9336, -0.9219, -0.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9219, -0.9609, -0.8633, -0.8633, -0.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4863, -0.5039, -0.4512, -0.4355, -0.4785]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.0166, -0.0172, -0.0074,  0.0155, -0.0182]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.0552, 0.0603, 0.0645, 0.0776, 0.0535]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4062, -1.4688, -1.3516, -1.3203, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:47:43,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 17:47:43,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.95 | bwd_microstep: 576.72 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 575.72 | step_microstep: 1.95
[2025-11-06 17:47:43,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.71 | bwd: 577.39 | bwd_inner: 1.49 | bwd_allreduce: 575.76 | step: 2.03
  1%|          | 31/3507 [02:56<1:21:35,  1.41s/it]                                                   {'loss': 1.5742, 'learning_rate': 5.849056603773585e-06, 'epoch': 0.01}
  1%|          | 31/3507 [02:56<1:21:35,  1.41s/it]tensor([[-0.8086, -0.8438, -0.7578, -0.7500, -0.8008]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1875, 0.1982, 0.2061, 0.2002, 0.1895]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:43,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.12 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.13
tensor([[-0.2236, -0.2354, -0.2041, -0.1836, -0.2227]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2793, -0.2949, -0.2393, -0.2383, -0.2715]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1543, 0.1592, 0.1729, 0.1768, 0.1572]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.5859, 1.6484, 1.5625, 1.5469, 1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.3438, 0.3574, 0.3633, 0.3516, 0.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5703, -0.5898, -0.5352, -0.5195, -0.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:47:46,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:47:46,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.45 | bwd_microstep: 1103.81 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1102.73 | step_microstep: 2.25
[2025-11-06 17:47:46,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.59 | bwd: 1104.82 | bwd_inner: 1.89 | bwd_allreduce: 1102.78 | step: 2.38
  1%|          | 32/3507 [03:00<2:01:33,  2.10s/it]                                                   {'loss': 1.5957, 'learning_rate': 6.037735849056604e-06, 'epoch': 0.01}
  1%|          | 32/3507 [03:00<2:01:33,  2.10s/it]tensor([[-0.2471, -0.2559, -0.2178, -0.2109, -0.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.7852, 0.8164, 0.7852, 0.7734, 0.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.7109, 0.7422, 0.7188, 0.7109, 0.7148]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:46,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.13 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.6953, 0.7266, 0.6914, 0.6992, 0.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-1.3594, -1.4141, -1.3047, -1.2734, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-0.7617, -0.7891, -0.7070, -0.7031, -0.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.2598, 0.2773, 0.2773, 0.2715, 0.2637]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1504, 0.1553, 0.1631, 0.1729, 0.1514]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:47:47,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:47:47,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.65 | bwd_microstep: 34.53 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 33.60 | step_microstep: 1.98
[2025-11-06 17:47:47,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.80 | bwd: 35.25 | bwd_inner: 1.48 | bwd_allreduce: 33.63 | step: 2.06
  1%|          | 33/3507 [03:01<1:32:10,  1.59s/it]                                                   {'loss': 1.6055, 'learning_rate': 6.226415094339623e-06, 'epoch': 0.01}
  1%|          | 33/3507 [03:01<1:32:10,  1.59s/it]tensor([[-0.5352, -0.5508, -0.5039, -0.4785, -0.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1777, -0.1855, -0.1445, -0.1416, -0.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1226, -0.1260, -0.1016, -0.0874, -0.1196]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.4844, -1.5469, -1.3984, -1.3906, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:47,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.85 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.1914, -0.1924, -0.1562, -0.1484, -0.1846]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4551, -0.4785, -0.4219, -0.4023, -0.4512]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[1.3125, 1.3750, 1.2969, 1.2891, 1.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9336, -0.9727, -0.8750, -0.8672, -0.9258]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:48,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:47:48,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 2.17
[2025-11-06 17:47:48,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.65 | bwd: 2.83 | bwd_inner: 1.92 | bwd_allreduce: 0.79 | step: 2.24
  1%|          | 34/3507 [03:02<1:27:04,  1.50s/it]                                                   {'loss': 1.582, 'learning_rate': 6.415094339622642e-06, 'epoch': 0.01}
  1%|          | 34/3507 [03:02<1:27:04,  1.50s/it]tensor([[-0.4395, -0.4551, -0.3906, -0.3926, -0.4355]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.2207, 0.2314, 0.2412, 0.2461, 0.2246]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1738, -0.1777, -0.1309, -0.1416, -0.1699]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:48,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.71 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4414, -0.4551, -0.3906, -0.3945, -0.4316]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.0835, 0.0942, 0.1152, 0.1084, 0.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0474, -0.0476, -0.0209, -0.0133, -0.0466]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.5039, 0.5312, 0.5234, 0.5117, 0.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.0703, 1.1172, 1.0547, 1.0625, 1.0703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:47:49,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 17:47:49,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.63 | bwd_microstep: 622.99 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 621.72 | step_microstep: 2.33
[2025-11-06 17:47:49,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.37 | bwd: 623.95 | bwd_inner: 2.05 | bwd_allreduce: 621.76 | step: 2.41
  1%|          | 35/3507 [03:03<1:20:08,  1.38s/it]                                                   {'loss': 1.5918, 'learning_rate': 6.60377358490566e-06, 'epoch': 0.01}
  1%|          | 35/3507 [03:03<1:20:08,  1.38s/it]tensor([[-0.4844, -0.5000, -0.4531, -0.4277, -0.4785]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:49,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.33 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.0649, -0.0669, -0.0432, -0.0243, -0.0635]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9297, -0.9727, -0.8828, -0.8516, -0.9258]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9219, -0.9609, -0.8750, -0.8398, -0.9180]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2412, -0.2500, -0.1992, -0.2041, -0.2354]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2812, -1.3438, -1.2109, -1.1953, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -4.6875, -4.3125, -4.2500, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1128, -0.1143, -0.0864, -0.0664, -0.1074]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:50,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 17:47:50,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.31 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.79
[2025-11-06 17:47:50,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 247.63 | bwd: 2.81 | bwd_inner: 1.90 | bwd_allreduce: 0.80 | step: 1.87
  1%|          | 36/3507 [03:04<1:11:39,  1.24s/it]                                                   {'loss': 1.5498, 'learning_rate': 6.792452830188679e-06, 'epoch': 0.01}
  1%|          | 36/3507 [03:04<1:11:39,  1.24s/it]tensor([[-1.8984, -1.9766, -1.8047, -1.7656, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:50,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.44 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[0.3750, 0.3945, 0.3789, 0.3984, 0.3730]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-0.5508, -0.5742, -0.5156, -0.4863, -0.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.2031, -1.2578, -1.1328, -1.1172, -1.1953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2930, -0.3047, -0.2676, -0.2373, -0.2930]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7969, -1.8750, -1.6953, -1.6797, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0649, -0.0679, -0.0294, -0.0206, -0.0593]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1797, -1.2266, -1.1172, -1.0938, -1.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 17:47:51,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:47:51,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.79 | bwd_microstep: 839.59 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 838.26 | step_microstep: 2.38
[2025-11-06 17:47:51,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.26 | bwd: 840.53 | bwd_inner: 2.11 | bwd_allreduce: 838.29 | step: 2.46
  1%|          | 37/3507 [03:05<1:10:28,  1.22s/it]                                                   {'loss': 1.5742, 'learning_rate': 6.981132075471699e-06, 'epoch': 0.01}
  1%|          | 37/3507 [03:05<1:10:28,  1.22s/it]tensor([[-0.4785, -0.4941, -0.4277, -0.4219, -0.4727]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.2188, 0.2266, 0.2393, 0.2451, 0.2168]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:51,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.29 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-0.2676, -0.2734, -0.2305, -0.2178, -0.2617]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7383, -0.7734, -0.6758, -0.6719, -0.7305]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2148, 0.2305, 0.2451, 0.2393, 0.2197]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.1396, 0.1494, 0.1543, 0.1641, 0.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.2578, 1.3203, 1.2422, 1.2422, 1.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.0107, 0.0118, 0.0510, 0.0420, 0.0137]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:52,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 17:47:52,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.35 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.92
[2025-11-06 17:47:52,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.65 | bwd: 2.93 | bwd_inner: 1.85 | bwd_allreduce: 0.91 | step: 2.02
  1%|          | 38/3507 [03:06<1:08:09,  1.18s/it]                                                   {'loss': 1.5908, 'learning_rate': 7.169811320754717e-06, 'epoch': 0.01}
  1%|          | 38/3507 [03:06<1:08:09,  1.18s/it]tensor([[0.4141, 0.4355, 0.4473, 0.4277, 0.4180]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:52,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.22 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.3125, -0.3262, -0.2715, -0.2520, -0.3086]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[1.4609, 1.5234, 1.4375, 1.4297, 1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.1953, -0.2012, -0.1611, -0.1455, -0.1914]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.4375, -1.5000, -1.3594, -1.3359, -1.4297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1455, -0.1562, -0.1104, -0.0928, -0.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2734, -1.3281, -1.2031, -1.1641, -1.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9219, -0.9609, -0.8516, -0.8398, -0.9180]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:47:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:47:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.41 | bwd_microstep: 1159.80 | bwd_inner_microstep: 1.39 | bwd_allreduce_microstep: 1158.32 | step_microstep: 1.60
[2025-11-06 17:47:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.63 | bwd: 1160.78 | bwd_inner: 2.31 | bwd_allreduce: 1158.36 | step: 1.67
  1%|          | 39/3507 [03:08<1:14:44,  1.29s/it]                                                   {'loss': 1.5781, 'learning_rate': 7.358490566037736e-06, 'epoch': 0.01}
  1%|          | 39/3507 [03:08<1:14:44,  1.29s/it]tensor([[-0.5312, -0.5547, -0.4941, -0.4570, -0.5273]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:54,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 118.06 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.18
tensor([[-0.4160, -0.4297, -0.3535, -0.3535, -0.4082]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.8789, 0.9180, 0.8867, 0.8789, 0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.0918, 0.1011, 0.1328, 0.1299, 0.0981]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4863, -0.5078, -0.4277, -0.4180, -0.4805]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.3770, 0.3926, 0.3926, 0.4004, 0.3789]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1387, -0.1455, -0.1050, -0.0825, -0.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.4121, 0.4355, 0.4375, 0.4316, 0.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:55,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 17:47:55,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.95 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.92 | step_microstep: 1.69
[2025-11-06 17:47:55,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.02 | bwd: 2.96 | bwd_inner: 1.85 | bwd_allreduce: 0.96 | step: 1.87
  1%|          | 40/3507 [03:09<1:20:59,  1.40s/it]                                                   {'loss': 1.5869, 'learning_rate': 7.5471698113207555e-06, 'epoch': 0.01}
  1%|          | 40/3507 [03:09<1:20:59,  1.40s/it]tensor([[-0.2012, -0.2080, -0.1514, -0.1494, -0.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9648, -1.0000, -0.8750, -0.8789, -0.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.3750, 0.3906, 0.3848, 0.4082, 0.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:56,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.46 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[0.7383, 0.7734, 0.7500, 0.7461, 0.7383]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6484, -0.6836, -0.6016, -0.5703, -0.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.9727, -1.0078, -0.8867, -0.8828, -0.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.1328, 1.1719, 1.1172, 1.1250, 1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.0557, 0.0593, 0.0840, 0.0977, 0.0593]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:47:56,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 17:47:56,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.11 | bwd_microstep: 275.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 275.00 | step_microstep: 2.00
[2025-11-06 17:47:56,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.58 | bwd: 276.95 | bwd_inner: 1.76 | bwd_allreduce: 275.06 | step: 2.09
  1%|          | 41/3507 [03:10<1:08:21,  1.18s/it]                                                   {'loss': 1.5801, 'learning_rate': 7.735849056603775e-06, 'epoch': 0.01}
  1%|          | 41/3507 [03:10<1:08:21,  1.18s/it]tensor([[-0.2256, -0.2334, -0.1650, -0.1748, -0.2129]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[0.5078, 0.5312, 0.5430, 0.5195, 0.5117]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0552, -0.0542, -0.0055, -0.0078, -0.0471]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.5625, 0.5938, 0.5938, 0.5703, 0.5664]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.2314, -0.2354, -0.2012, -0.1758, -0.2256]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:47:57,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.92 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.0703, -1.1094, -0.9766, -0.9648, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2412, -0.2461, -0.1787, -0.1924, -0.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9023, -0.9375, -0.8242, -0.8086, -0.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:59,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:47:59,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.77 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.93
[2025-11-06 17:47:59,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 433.71 | bwd: 2.94 | bwd_inner: 1.93 | bwd_allreduce: 0.89 | step: 2.01
  1%|          | 42/3507 [03:12<1:31:12,  1.58s/it]                                                   {'loss': 1.583, 'learning_rate': 7.924528301886793e-06, 'epoch': 0.01}
  1%|          | 42/3507 [03:12<1:31:12,  1.58s/it]tensor([[-0.3027, -0.3164, -0.2363, -0.2363, -0.2949]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1279, 0.1318, 0.1777, 0.1572, 0.1357]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.1758, -0.1875, -0.1367, -0.1133, -0.1738]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -3.5625, -3.2188, -3.1875, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:47:59,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.26 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.4355, -0.4551, -0.3750, -0.3672, -0.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0156, -1.0625, -0.9375, -0.9102, -1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7969, -0.8242, -0.7031, -0.7148, -0.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.0393, 0.0430, 0.0918, 0.0767, 0.0449]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:47:59,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.27 | optimizer_step: 0.23
[2025-11-06 17:47:59,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.51 | bwd_microstep: 16.20 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 15.14 | step_microstep: 2.68
[2025-11-06 17:47:59,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.78 | bwd: 17.15 | bwd_inner: 1.82 | bwd_allreduce: 15.18 | step: 2.75
  1%|          | 43/3507 [03:13<1:10:45,  1.23s/it]                                                   {'loss': 1.5557, 'learning_rate': 8.113207547169812e-06, 'epoch': 0.01}
  1%|          | 43/3507 [03:13<1:10:45,  1.23s/it]tensor([[-0.6953, -0.7227, -0.6367, -0.6055, -0.6914]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.4766, 1.5469, 1.4688, 1.4375, 1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1748, -0.1768, -0.1123, -0.1152, -0.1680]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5195, -0.5430, -0.4707, -0.4434, -0.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3867, -0.3965, -0.3125, -0.3184, -0.3770]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0159, -0.0172,  0.0278,  0.0344, -0.0128]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:00,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.94 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[0.7148, 0.7500, 0.7422, 0.7188, 0.7148]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.5469, 0.5703, 0.5625, 0.5742, 0.5430]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:01,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:48:01,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 335.77 | bwd_microstep: 408.79 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 407.74 | step_microstep: 2.10
[2025-11-06 17:48:01,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 537.73 | bwd: 409.70 | bwd_inner: 1.74 | bwd_allreduce: 407.79 | step: 2.19
  1%|▏         | 44/3507 [03:15<1:27:31,  1.52s/it]                                                   {'loss': 1.5791, 'learning_rate': 8.301886792452832e-06, 'epoch': 0.01}
  1%|▏         | 44/3507 [03:15<1:27:31,  1.52s/it]tensor([[-0.2236, -0.2305, -0.1582, -0.1689, -0.2158]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:01,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.14 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[0.9961, 1.0391, 1.0078, 0.9883, 0.9961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5430, -0.5703, -0.4980, -0.4570, -0.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2891, -1.3438, -1.1719, -1.1719, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8125, -0.8438, -0.7461, -0.7188, -0.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3281, -1.3828, -1.2422, -1.2031, -1.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3887, -0.4043, -0.3301, -0.3086, -0.3828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5859, -1.6562, -1.4844, -1.4453, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:02,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:48:02,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.00 | bwd_microstep: 112.39 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 110.98 | step_microstep: 1.93
[2025-11-06 17:48:02,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.15 | bwd: 113.36 | bwd_inner: 2.22 | bwd_allreduce: 111.02 | step: 2.01
  1%|▏         | 45/3507 [03:16<1:10:38,  1.22s/it]                                                   {'loss': 1.5479, 'learning_rate': 8.49056603773585e-06, 'epoch': 0.01}
  1%|▏         | 45/3507 [03:16<1:10:38,  1.22s/it]tensor([[-0.8320, -0.8633, -0.7305, -0.7383, -0.8164]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.0967, 0.1011, 0.1543, 0.1455, 0.1035]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.4258, -0.4395, -0.3457, -0.3516, -0.4160]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0659, 0.0757, 0.1069, 0.1147, 0.0713]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7812, -0.8047, -0.7148, -0.6797, -0.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:02,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.07 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.3867, -0.3984, -0.3418, -0.3047, -0.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7266, -1.8047, -1.5938, -1.5781, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2031, -2.2969, -2.0625, -2.0312, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:04,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:48:04,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.64 | bwd_microstep: 622.19 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 621.00 | step_microstep: 1.94
[2025-11-06 17:48:04,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.71 | bwd: 623.25 | bwd_inner: 2.07 | bwd_allreduce: 621.04 | step: 2.03
  1%|▏         | 46/3507 [03:18<1:26:58,  1.51s/it]                                                   {'loss': 1.5469, 'learning_rate': 8.67924528301887e-06, 'epoch': 0.01}
  1%|▏         | 46/3507 [03:18<1:26:58,  1.51s/it]tensor([[0.3652, 0.3789, 0.3887, 0.4004, 0.3633]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1592, -0.1680, -0.1128, -0.0894, -0.1572]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.4258, 0.4414, 0.4707, 0.4648, 0.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:04,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.50 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-0.8555, -0.8867, -0.7422, -0.7578, -0.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5781, -1.6406, -1.4375, -1.4531, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.5312, -0.5469, -0.4551, -0.4395, -0.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4824, -0.5039, -0.4141, -0.4023, -0.4746]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.0835, 0.0830, 0.1245, 0.1387, 0.0854]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:48:04,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.22 | optimizer_step: 0.23
[2025-11-06 17:48:04,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.63 | bwd_microstep: 19.92 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 18.82 | step_microstep: 2.28
[2025-11-06 17:48:04,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.16 | bwd: 20.62 | bwd_inner: 1.60 | bwd_allreduce: 18.87 | step: 2.37
  1%|▏         | 47/3507 [03:18<1:09:04,  1.20s/it]                                                   {'loss': 1.5576, 'learning_rate': 8.867924528301887e-06, 'epoch': 0.01}
  1%|▏         | 47/3507 [03:18<1:09:04,  1.20s/it]tensor([[-1.0781, -1.1094, -0.9570, -0.9570, -1.0547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1797, -1.2266, -1.0781, -1.0547, -1.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3008, 0.3145, 0.3379, 0.3438, 0.3008]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:05,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.44 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.1426, -0.1494, -0.0884, -0.0762, -0.1367]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0625, -1.1094, -0.9648, -0.9414, -1.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4297, -0.4434, -0.3691, -0.3359, -0.4238]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4902, -0.5039, -0.4043, -0.4082, -0.4805]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.2441, 0.2578, 0.2969, 0.2930, 0.2480]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:07,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:48:07,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.22 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.40
[2025-11-06 17:48:07,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.68 | bwd: 2.87 | bwd_inner: 1.83 | bwd_allreduce: 0.89 | step: 2.49
  1%|▏         | 48/3507 [03:21<1:29:10,  1.55s/it]                                                   {'loss': 1.5566, 'learning_rate': 9.056603773584907e-06, 'epoch': 0.01}
  1%|▏         | 48/3507 [03:21<1:29:10,  1.55s/it]tensor([[-0.2334, -0.2412, -0.1543, -0.1582, -0.2266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9082e-02, -6.3477e-02, -2.9564e-05,  1.8921e-03, -5.6152e-02]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9531, -0.9883, -0.8438, -0.8398, -0.9414]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.4180, 0.4336, 0.4414, 0.4531, 0.4180]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:07,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.87 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.1602, -0.1650, -0.1079, -0.0825, -0.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8516, -1.9141, -1.6719, -1.6797, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4629, -0.4785, -0.4004, -0.3730, -0.4590]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.7188, -0.7461, -0.6172, -0.6133, -0.7070]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:48:07,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:48:07,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.99 | bwd_microstep: 113.00 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 111.70 | step_microstep: 1.44
[2025-11-06 17:48:07,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.89 | bwd: 113.99 | bwd_inner: 2.13 | bwd_allreduce: 111.74 | step: 1.52
  1%|▏         | 49/3507 [03:21<1:11:32,  1.24s/it]                                                   {'loss': 1.5547, 'learning_rate': 9.245283018867926e-06, 'epoch': 0.01}
  1%|▏         | 49/3507 [03:21<1:11:32,  1.24s/it]tensor([[-0.4043, -0.4219, -0.3184, -0.3105, -0.3965]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:07,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.88 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.1572, -0.1670, -0.1021, -0.0830, -0.1533]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4629, -0.4766, -0.3652, -0.3750, -0.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9375, -0.9727, -0.8164, -0.8281, -0.9258]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.2129, 0.2188, 0.2793, 0.2441, 0.2168]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7812, -1.8438, -1.6328, -1.6016, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9102, -0.9531, -0.8125, -0.7852, -0.9023]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[1.2500, 1.2969, 1.2344, 1.2344, 1.2422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:09,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:48:09,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.82 | bwd_microstep: 463.72 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 462.65 | step_microstep: 2.45
[2025-11-06 17:48:09,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.71 | bwd: 464.69 | bwd_inner: 1.85 | bwd_allreduce: 462.70 | step: 2.54
  1%|▏         | 50/3507 [03:23<1:26:48,  1.51s/it]                                                   {'loss': 1.5508, 'learning_rate': 9.433962264150944e-06, 'epoch': 0.01}
  1%|▏         | 50/3507 [03:23<1:26:48,  1.51s/it]tensor([[0.0272, 0.0317, 0.0918, 0.0796, 0.0320]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:10,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.93 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-0.0908, -0.0898, -0.0062, -0.0223, -0.0830]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.2100, 0.2178, 0.2578, 0.2578, 0.2139]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0574, 0.0613, 0.1089, 0.1196, 0.0615]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.4473, 0.4590, 0.4922, 0.4863, 0.4492]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6055, -0.6250, -0.5000, -0.5039, -0.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.3945, -0.4043, -0.2871, -0.3066, -0.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0991, -0.1069, -0.0270, -0.0247, -0.0947]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:48:10,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:48:10,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.10 | bwd_microstep: 199.97 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 199.09 | step_microstep: 1.33
[2025-11-06 17:48:10,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.06 | bwd: 201.08 | bwd_inner: 1.78 | bwd_allreduce: 199.14 | step: 1.44
  1%|▏         | 51/3507 [03:24<1:10:51,  1.23s/it]                                                   {'loss': 1.5664, 'learning_rate': 9.622641509433963e-06, 'epoch': 0.01}
  1%|▏         | 51/3507 [03:24<1:10:51,  1.23s/it]tensor([[-1.2109, -1.2500, -1.1016, -1.0547, -1.1953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.8789, 0.9141, 0.8867, 0.8789, 0.8789]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7969, -0.8281, -0.6992, -0.6719, -0.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:48:10,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.01 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4180, -0.4355, -0.3535, -0.3262, -0.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4180, -0.4395, -0.3457, -0.3164, -0.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3125, -3.4375, -3.0625, -3.0156, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6875, -0.7070, -0.6016, -0.5586, -0.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.4453, -0.4629, -0.3379, -0.3555, -0.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:48:13,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:48:13,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.58 | bwd_microstep: 1261.07 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1259.97 | step_microstep: 1.78
[2025-11-06 17:48:13,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.59 | bwd: 1261.98 | bwd_inner: 1.84 | bwd_allreduce: 1260.01 | step: 1.86
  1%|▏         | 52/3507 [03:26<1:34:18,  1.64s/it]                                                   {'loss': 1.54, 'learning_rate': 9.811320754716981e-06, 'epoch': 0.01}
  1%|▏         | 52/3507 [03:26<1:34:18,  1.64s/it]tensor([[-0.3730, -0.3906, -0.2812, -0.2793, -0.3652]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:13,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.02 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[0.5586, 0.5742, 0.6055, 0.5898, 0.5586]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[3.9375, 4.0938, 3.7969, 3.7344, 3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0547, -1.0938, -0.9453, -0.9141, -1.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.2334, 0.2393, 0.2715, 0.2949, 0.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9453, -0.9805, -0.8203, -0.8086, -0.9336]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2734, 0.2832, 0.3301, 0.3047, 0.2773]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5117, -0.5312, -0.4004, -0.4102, -0.5039]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:48:13,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 17:48:13,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.80 | bwd_microstep: 259.39 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 258.21 | step_microstep: 2.60
[2025-11-06 17:48:13,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.84 | bwd: 260.36 | bwd_inner: 1.95 | bwd_allreduce: 258.26 | step: 2.69
  2%|▏         | 53/3507 [03:27<1:15:50,  1.32s/it]                                                   {'loss': 1.5723, 'learning_rate': 1e-05, 'epoch': 0.02}
  2%|▏         | 53/3507 [03:27<1:15:50,  1.32s/it]tensor([[-0.8164, -0.8438, -0.6914, -0.6914, -0.8008]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8359, -1.8984, -1.6250, -1.6484, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.4316, -0.4531, -0.3516, -0.3262, -0.4277]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0938, -1.1328, -0.9375, -0.9492, -1.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:14,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.82 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.0078, -1.0391, -0.8555, -0.8672, -0.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.6758, 0.6992, 0.7070, 0.7031, 0.6758]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.5391, 0.5625, 0.5859, 0.5703, 0.5391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1465, -0.1543, -0.0859, -0.0525, -0.1436]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:15,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.25
[2025-11-06 17:48:15,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.74 | bwd_microstep: 1033.68 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1032.58 | step_microstep: 1.91
[2025-11-06 17:48:15,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 443.58 | bwd: 1034.61 | bwd_inner: 1.87 | bwd_allreduce: 1032.61 | step: 1.98
  2%|▏         | 54/3507 [03:29<1:19:17,  1.38s/it]                                                   {'loss': 1.541, 'learning_rate': 1.018867924528302e-05, 'epoch': 0.02}
  2%|▏         | 54/3507 [03:29<1:19:17,  1.38s/it]tensor([[0.9922, 1.0312, 1.0000, 1.0000, 0.9883]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1582, 0.1631, 0.2295, 0.2266, 0.1611]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.0459, 0.0486, 0.1289, 0.1152, 0.0510]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7539, -0.7852, -0.6289, -0.6250, -0.7461]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:15,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.24 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.8594, -0.8906, -0.7383, -0.7266, -0.8477]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2363, -0.2500, -0.1768, -0.1367, -0.2393]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9961, -1.0312, -0.8516, -0.8477, -0.9805]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.4766, -1.5312, -1.2969, -1.3047, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:15,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:48:15,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.51 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.10
[2025-11-06 17:48:15,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 490.77 | bwd: 2.72 | bwd_inner: 1.77 | bwd_allreduce: 0.83 | step: 2.19
  2%|▏         | 55/3507 [03:29<1:06:28,  1.16s/it]                                                   {'loss': 1.5439, 'learning_rate': 1.0377358490566038e-05, 'epoch': 0.02}
  2%|▏         | 55/3507 [03:29<1:06:28,  1.16s/it]tensor([[-2.2188, -2.3125, -1.9844, -1.9844, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125, -2.9219, -2.5625, -2.5156, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4277, -0.4453, -0.3398, -0.3184, -0.4238]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:16,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.56 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4062, -0.4219, -0.3008, -0.3066, -0.4004]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7070, -0.7266, -0.6133, -0.5703, -0.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3516, -0.3633, -0.2471, -0.2559, -0.3418]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.0177, 0.0208, 0.1123, 0.0957, 0.0242]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7969, -0.8242, -0.6562, -0.6719, -0.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:17,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 17:48:17,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.64 | bwd_microstep: 2.17 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.00
[2025-11-06 17:48:17,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.23 | bwd: 3.05 | bwd_inner: 1.96 | bwd_allreduce: 0.95 | step: 2.08
  2%|▏         | 56/3507 [03:30<1:09:21,  1.21s/it]                                                   {'loss': 1.5156, 'learning_rate': 1.0566037735849058e-05, 'epoch': 0.02}
  2%|▏         | 56/3507 [03:30<1:09:21,  1.21s/it]tensor([[-0.4980, -0.5156, -0.3770, -0.3926, -0.4824]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5625, -2.6562, -2.3281, -2.2812, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')tensor([[-0.5117, -0.5273, -0.3652, -0.4160, -0.4941]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)
 tensor([2], device='cuda:3')
tensor([[0.6172, 0.6406, 0.6484, 0.6562, 0.6133]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:17,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.79 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[0.5820, 0.6055, 0.6484, 0.6133, 0.5898]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.4082, 0.4258, 0.4766, 0.4414, 0.4160]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
tensor([[-0.0342, -0.0315,  0.0496,  0.0277, -0.0262]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7148, -0.7383, -0.5664, -0.5898, -0.6992]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:48:17,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:48:17,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.78 | bwd_microstep: 57.33 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 56.29 | step_microstep: 1.42
[2025-11-06 17:48:17,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.61 | bwd: 58.42 | bwd_inner: 1.89 | bwd_allreduce: 56.35 | step: 1.54
  2%|▏         | 57/3507 [03:31<56:15,  1.02it/s]                                                   {'loss': 1.542, 'learning_rate': 1.0754716981132076e-05, 'epoch': 0.02}
  2%|▏         | 57/3507 [03:31<56:15,  1.02it/s]tensor([[-0.5078, -0.5234, -0.3945, -0.3906, -0.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:17,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.31 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.0835, -0.0815,  0.0302, -0.0053, -0.0742]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1260, 0.1299, 0.2041, 0.1777, 0.1279]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.2344, 0.2480, 0.3223, 0.2852, 0.2412]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1719, -1.2109, -1.0391, -0.9844, -1.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.9688, -2.0469, -1.7656, -1.7266, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8945, -0.9297, -0.7773, -0.7422, -0.8867]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.5703, -0.5898, -0.4590, -0.4492, -0.5586]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:19,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:48:19,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.46 | bwd_microstep: 11.40 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 10.07 | step_microstep: 2.04
[2025-11-06 17:48:19,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.79 | bwd: 12.23 | bwd_inner: 2.02 | bwd_allreduce: 10.09 | step: 2.10
  2%|▏         | 58/3507 [03:33<1:13:06,  1.27s/it]                                                   {'loss': 1.543, 'learning_rate': 1.0943396226415095e-05, 'epoch': 0.02}
  2%|▏         | 58/3507 [03:33<1:13:06,  1.27s/it]tensor([[-0.3984, -0.4180, -0.2773, -0.2773, -0.3926]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[3.3281, 3.4531, 3.1875, 3.1250, 3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5859, -1.6484, -1.3906, -1.3828, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:19,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.54 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.6836, -0.7031, -0.5195, -0.5625, -0.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1787, -0.1846, -0.1094, -0.0645, -0.1777]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.4219, -1.4766, -1.2422, -1.2266, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2207, -0.2266, -0.1128, -0.1328, -0.2139]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.1885, 0.1914, 0.2871, 0.2539, 0.1953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:48:20,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:48:20,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.13 | bwd_microstep: 456.69 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 455.53 | step_microstep: 1.55
[2025-11-06 17:48:20,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.70 | bwd: 457.54 | bwd_inner: 1.85 | bwd_allreduce: 455.56 | step: 1.64
  2%|▏         | 59/3507 [03:34<1:12:02,  1.25s/it]                                                   {'loss': 1.5518, 'learning_rate': 1.1132075471698115e-05, 'epoch': 0.02}
  2%|▏         | 59/3507 [03:34<1:12:02,  1.25s/it]tensor([[0.3613, 0.3711, 0.4160, 0.4199, 0.3574]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.6875, 0.7109, 0.7031, 0.7148, 0.6836]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3047, -0.3184, -0.2002, -0.1895, -0.2988]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8047, -0.8281, -0.6289, -0.6602, -0.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:21,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.62 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-0.4180, -0.4375, -0.3047, -0.2891, -0.4102]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.8906, 0.9180, 0.9375, 0.8945, 0.8867]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[0.0801, 0.0806, 0.1455, 0.1631, 0.0806]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7031, -0.7305, -0.5547, -0.5586, -0.6914]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:22,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:48:22,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.22 | bwd_microstep: 1.89 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.81
[2025-11-06 17:48:22,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 476.82 | bwd: 2.95 | bwd_inner: 1.99 | bwd_allreduce: 0.83 | step: 1.89
  2%|▏         | 60/3507 [03:36<1:24:14,  1.47s/it]                                                   {'loss': 1.5518, 'learning_rate': 1.1320754716981132e-05, 'epoch': 0.02}
  2%|▏         | 60/3507 [03:36<1:24:14,  1.47s/it]tensor([[-1.0938, -1.1328, -0.8945, -0.9180, -1.0703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.1953, 0.2021, 0.2676, 0.2715, 0.1963]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.3848, 0.4004, 0.4238, 0.4512, 0.3828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:22,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.46 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.4219, -2.5000, -2.1250, -2.1094, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1758, -0.1846, -0.0967, -0.0649, -0.1748]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3535, -0.3672, -0.2520, -0.2402, -0.3457]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.2891, -1.3359, -1.0781, -1.0938, -1.2734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.3398, 0.3496, 0.4277, 0.3965, 0.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:48:23,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:48:23,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.19 | bwd_microstep: 111.58 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 110.29 | step_microstep: 1.46
[2025-11-06 17:48:23,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.67 | bwd: 112.45 | bwd_inner: 1.97 | bwd_allreduce: 110.33 | step: 1.54
  2%|▏         | 61/3507 [03:37<1:07:02,  1.17s/it]                                                   {'loss': 1.5195, 'learning_rate': 1.1509433962264152e-05, 'epoch': 0.02}
  2%|▏         | 61/3507 [03:37<1:07:02,  1.17s/it]tensor([[-0.4707, -0.4824, -0.3320, -0.3633, -0.4590]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:23,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.20 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[0.0417, 0.0466, 0.1611, 0.1060, 0.0505]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.3809, 0.3926, 0.4219, 0.4395, 0.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1846, -0.1904, -0.0713, -0.0894, -0.1807]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.3848, 0.3965, 0.4648, 0.4277, 0.3867]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4961, -0.5156, -0.3457, -0.3828, -0.4805]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1924, -0.2002, -0.1196, -0.0859, -0.1895]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.4434, -0.4648, -0.3164, -0.3105, -0.4355]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:25,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:48:25,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.94 | bwd_microstep: 294.07 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 292.64 | step_microstep: 2.00
[2025-11-06 17:48:25,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 272.16 | bwd: 294.94 | bwd_inner: 2.13 | bwd_allreduce: 292.66 | step: 2.07
  2%|▏         | 62/3507 [03:38<1:20:27,  1.40s/it]                                                   {'loss': 1.5381, 'learning_rate': 1.169811320754717e-05, 'epoch': 0.02}
  2%|▏         | 62/3507 [03:38<1:20:27,  1.40s/it]tensor([[0.2734, 0.2793, 0.3320, 0.3496, 0.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:25,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.23 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.8086, 0.8359, 0.8672, 0.8242, 0.8047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0221, -0.0258,  0.0688,  0.0776, -0.0216]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7109, -1.7734, -1.4688, -1.4766, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1328, -1.1641, -0.9297, -0.9609, -1.1016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0114, 0.0124, 0.1396, 0.0879, 0.0189]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[1.8672, 1.9375, 1.8203, 1.7812, 1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8906, -0.9219, -0.7500, -0.7070, -0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:25,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:48:25,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.89 | bwd_microstep: 72.48 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 71.42 | step_microstep: 1.76
[2025-11-06 17:48:25,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.14 | bwd: 73.34 | bwd_inner: 1.77 | bwd_allreduce: 71.45 | step: 1.84
  2%|▏         | 63/3507 [03:39<1:03:19,  1.10s/it]                                                   {'loss': 1.54, 'learning_rate': 1.188679245283019e-05, 'epoch': 0.02}
  2%|▏         | 63/3507 [03:39<1:03:19,  1.10s/it]tensor([[-1.2344, -1.2734, -1.0391, -1.0234, -1.2109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:25,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.50 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[1.5469, 1.6016, 1.5469, 1.4688, 1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.0200, -0.0242,  0.0674,  0.0859, -0.0179]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6562, -1.7109, -1.4141, -1.4141, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1514, -0.1543, -0.0047, -0.0518, -0.1416]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.9180, 0.9492, 0.9453, 0.9062, 0.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0388, -0.0466,  0.0708,  0.0615, -0.0391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.0118, 0.0119, 0.0879, 0.1128, 0.0117]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:27,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:48:27,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.67 | bwd_microstep: 2.36 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.93 | step_microstep: 1.99
[2025-11-06 17:48:27,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.19 | bwd: 3.43 | bwd_inner: 2.30 | bwd_allreduce: 0.98 | step: 2.08
  2%|▏         | 64/3507 [03:40<1:10:14,  1.22s/it]                                                   {'loss': 1.543, 'learning_rate': 1.2075471698113209e-05, 'epoch': 0.02}
  2%|▏         | 64/3507 [03:40<1:10:14,  1.22s/it]tensor([[-0.6133, -0.6406, -0.4902, -0.4512, -0.6055]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2275, 0.2363, 0.3242, 0.2871, 0.2334]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8398, -0.8711, -0.6484, -0.6758, -0.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:27,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.34 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.5430, -0.5664, -0.4316, -0.3926, -0.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9492, -0.9844, -0.7461, -0.7734, -0.9336]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8281, -1.8984, -1.5781, -1.5391, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1396, -0.1426, -0.0008, -0.0223, -0.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.2578, -1.3047, -1.0469, -1.0234, -1.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:27,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:48:27,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.47 | bwd_microstep: 81.46 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 80.29 | step_microstep: 1.46
[2025-11-06 17:48:27,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.83 | bwd: 82.37 | bwd_inner: 1.93 | bwd_allreduce: 80.32 | step: 1.55
  2%|▏         | 65/3507 [03:41<58:07,  1.01s/it]                                                   {'loss': 1.501, 'learning_rate': 1.2264150943396227e-05, 'epoch': 0.02}
  2%|▏         | 65/3507 [03:41<58:07,  1.01s/it]tensor([[-1.1328, -1.1719, -0.8984, -0.9258, -1.1172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2441, -0.2539, -0.1030, -0.1543, -0.2314]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1582, -0.1680, -0.0635, -0.0266, -0.1533]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:27,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.31 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.6914, 0.7188, 0.7578, 0.6914, 0.6914]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9609, -2.0312, -1.7109, -1.6641, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.3770, -0.3887, -0.2021, -0.2539, -0.3633]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1875, -1.2266, -0.9688, -0.9688, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5898, -0.6133, -0.4277, -0.4297, -0.5820]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:29,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:48:29,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 824.48 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 823.24 | step_microstep: 2.27
[2025-11-06 17:48:29,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.99 | bwd: 825.34 | bwd_inner: 1.93 | bwd_allreduce: 823.27 | step: 2.35
  2%|▏         | 66/3507 [03:43<1:18:02,  1.36s/it]                                                   {'loss': 1.498, 'learning_rate': 1.2452830188679246e-05, 'epoch': 0.02}
  2%|▏         | 66/3507 [03:43<1:18:02,  1.36s/it]tensor([[-0.7148, -0.7383, -0.5312, -0.5469, -0.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.4141, -1.4766, -1.1797, -1.1641, -1.3984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0286, -0.0266,  0.0986,  0.0771, -0.0245]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:30,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.65 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-0.8867, -0.9180, -0.7422, -0.6875, -0.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.3105, 0.3164, 0.3887, 0.3867, 0.3086]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1680, -0.1826, -0.0620, -0.0447, -0.1680]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2969, -1.3438, -1.0859, -1.0547, -1.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.5078, -0.5234, -0.3516, -0.3574, -0.4961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:48:30,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:48:30,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.09 | bwd_microstep: 15.11 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 13.93 | step_microstep: 2.18
[2025-11-06 17:48:30,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.76 | bwd: 16.17 | bwd_inner: 2.08 | bwd_allreduce: 13.96 | step: 2.27
  2%|▏         | 67/3507 [03:44<1:02:48,  1.10s/it]                                                   {'loss': 1.5078, 'learning_rate': 1.2641509433962264e-05, 'epoch': 0.02}
  2%|▏         | 67/3507 [03:44<1:02:48,  1.10s/it]tensor([[-0.8008, -0.8242, -0.5703, -0.6367, -0.7773]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:30,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.48 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[0.2871, 0.2949, 0.4082, 0.3418, 0.2949]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[0.1680, 0.1670, 0.2637, 0.2559, 0.1689]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2051, 0.2148, 0.2949, 0.2871, 0.2070]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.3164, -0.3281, -0.1768, -0.2100, -0.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[h264 @ 0x9207380] mmco: unref short failure
tensor([[0.6680, 0.6875, 0.7305, 0.6836, 0.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0078, -1.0469, -0.8477, -0.7891, -1.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5508, -0.5664, -0.3691, -0.3926, -0.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:32,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:48:32,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.57 | bwd_microstep: 2387.03 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 2385.60 | step_microstep: 1.74
[2025-11-06 17:48:32,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.08 | bwd: 2388.02 | bwd_inner: 2.20 | bwd_allreduce: 2385.66 | step: 1.84
  2%|▏         | 68/3507 [03:46<1:31:03,  1.59s/it]                                                   {'loss': 1.5371, 'learning_rate': 1.2830188679245283e-05, 'epoch': 0.02}
  2%|▏         | 68/3507 [03:46<1:31:03,  1.59s/it]tensor([[0.0410, 0.0437, 0.1914, 0.1377, 0.0479]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:33,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.57 | bwd_microstep: 1.63 | bwd_inner_microstep: 1.52 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-0.0250, -0.0275,  0.1035,  0.0801, -0.0205]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3027, -0.3105, -0.2031, -0.1660, -0.3027]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2012, -0.2119, -0.0540, -0.0767, -0.1963]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.3145, -0.3262, -0.1338, -0.1895, -0.3027]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.3516, 0.3633, 0.4082, 0.4336, 0.3418]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4297, -0.4473, -0.2734, -0.2734, -0.4238]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.6680, 0.6914, 0.6875, 0.7109, 0.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:33,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.14
[2025-11-06 17:48:33,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.86 | bwd_microstep: 97.17 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 96.00 | step_microstep: 1.38
[2025-11-06 17:48:33,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.46 | bwd: 98.79 | bwd_inner: 2.62 | bwd_allreduce: 96.04 | step: 1.46
  2%|▏         | 69/3507 [03:47<1:11:19,  1.24s/it]                                                   {'loss': 1.5342, 'learning_rate': 1.3018867924528303e-05, 'epoch': 0.02}
  2%|▏         | 69/3507 [03:47<1:11:19,  1.24s/it]tensor([[-0.0115, -0.0165,  0.1108,  0.1084, -0.0112]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5156, -2.6094, -2.1875, -2.1094, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 17:48:33,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.37 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[0.1079, 0.1060, 0.2158, 0.2207, 0.1050]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.3066, 0.3184, 0.4375, 0.3574, 0.3105]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1221, -0.1260,  0.0084, -0.0119, -0.1152]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8359, -0.8750, -0.6250, -0.6211, -0.8242]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.7422, -0.7695, -0.5352, -0.5547, -0.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8750, -0.9102, -0.6836, -0.6641, -0.8633]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:34,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.19 | optimizer_step: 0.26
[2025-11-06 17:48:34,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.15 | bwd_microstep: 612.99 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 611.43 | step_microstep: 1.89
[2025-11-06 17:48:34,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.54 | bwd: 613.89 | bwd_inner: 2.30 | bwd_allreduce: 611.47 | step: 1.97
  2%|▏         | 70/3507 [03:48<1:07:38,  1.18s/it]                                                   {'loss': 1.541, 'learning_rate': 1.320754716981132e-05, 'epoch': 0.02}
  2%|▏         | 70/3507 [03:48<1:07:38,  1.18s/it]tensor([[-0.9883, -1.0234, -0.7422, -0.7734, -0.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.0398, 0.0420, 0.2061, 0.1387, 0.0486]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:34,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.63 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[0.3125, 0.3242, 0.4434, 0.3613, 0.3184]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.8867, -0.9258, -0.6914, -0.6523, -0.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6328, -0.6523, -0.4082, -0.4707, -0.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
tensor([[-0.5859, -0.6133, -0.4082, -0.3965, -0.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2812, -0.2891, -0.0854, -0.1611, -0.2676]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.3047, 1.3516, 1.2812, 1.2812, 1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:48:35,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:48:35,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 80.85 | bwd_microstep: 495.58 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 494.55 | step_microstep: 1.62
[2025-11-06 17:48:35,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.50 | bwd: 496.60 | bwd_inner: 1.89 | bwd_allreduce: 494.59 | step: 1.71
  2%|▏         | 71/3507 [03:49<1:01:41,  1.08s/it]                                                   {'loss': 1.5283, 'learning_rate': 1.339622641509434e-05, 'epoch': 0.02}
  2%|▏         | 71/3507 [03:49<1:01:41,  1.08s/it]tensor([[-0.0669, -0.0698,  0.0249,  0.0649, -0.0708]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-0.9727, -1.0000, -0.8008, -0.7266, -0.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7266, -0.7500, -0.4805, -0.5391, -0.7070]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:35,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.78 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.5312, -0.5469, -0.3965, -0.3613, -0.5273]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1279, -0.1318,  0.0386,  0.0091, -0.1226]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3281, -1.3750, -1.1094, -1.0469, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.1494, -0.1602, -0.0039, -0.0201, -0.1475]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9453, -0.9766, -0.6914, -0.7539, -0.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:48:38,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 17:48:38,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.98 | bwd_microstep: 2816.67 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 2815.54 | step_microstep: 1.96
[2025-11-06 17:48:38,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.79 | bwd: 2817.45 | bwd_inner: 1.74 | bwd_allreduce: 2815.58 | step: 2.04
  2%|▏         | 72/3507 [03:52<1:37:57,  1.71s/it]                                                   {'loss': 1.501, 'learning_rate': 1.3584905660377358e-05, 'epoch': 0.02}
  2%|▏         | 72/3507 [03:52<1:37:57,  1.71s/it]tensor([[-0.7461, -0.7773, -0.5586, -0.5547, -0.7383]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:38,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.23 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4141, -0.4238, -0.2090, -0.2598, -0.4004]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.3457, 0.3555, 0.4277, 0.4297, 0.3398]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7422, -0.7773, -0.5742, -0.5195, -0.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.1157, 0.1230, 0.2520, 0.1807, 0.1226]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.8242, -0.8516, -0.5703, -0.6367, -0.8008]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.3516, -0.3613, -0.1436, -0.2178, -0.3359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0645, 0.0603, 0.1875, 0.1982, 0.0608]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:38,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:48:38,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.93 | bwd_microstep: 63.96 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 62.78 | step_microstep: 1.44
[2025-11-06 17:48:38,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.18 | bwd: 65.24 | bwd_inner: 2.31 | bwd_allreduce: 62.81 | step: 1.52
  2%|▏         | 73/3507 [03:52<1:16:55,  1.34s/it]                                                   {'loss': 1.498, 'learning_rate': 1.3773584905660378e-05, 'epoch': 0.02}
  2%|▏         | 73/3507 [03:52<1:16:55,  1.34s/it]tensor([[0.1050, 0.1108, 0.2578, 0.1992, 0.1113]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6367, -0.6641, -0.4512, -0.4336, -0.6289]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4922, -1.5469, -1.1797, -1.2109, -1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.6328, -1.6875, -1.2969, -1.3203, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9219, -0.9570, -0.7383, -0.6797, -0.9102]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1523, 0.1572, 0.2949, 0.2217, 0.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:40,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.41 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4473, -0.4688, -0.3145, -0.2656, -0.4434]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.4375, -1.4844, -1.1250, -1.1328, -1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:41,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:48:41,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.32 | bwd_microstep: 224.74 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 223.72 | step_microstep: 1.98
[2025-11-06 17:48:41,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.75 | bwd: 225.71 | bwd_inner: 1.82 | bwd_allreduce: 223.76 | step: 2.06
  2%|▏         | 74/3507 [03:55<1:33:48,  1.64s/it]                                                   {'loss': 1.5137, 'learning_rate': 1.3962264150943397e-05, 'epoch': 0.02}
  2%|▏         | 74/3507 [03:55<1:33:48,  1.64s/it]tensor([[0.6953, 0.7188, 0.7969, 0.7188, 0.6914]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:41,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.64 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.3906, -1.4375, -1.1250, -1.0938, -1.3672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3262, 0.3359, 0.4336, 0.4199, 0.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.7070, 0.7305, 0.8164, 0.7422, 0.6992]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8906, -0.9258, -0.7109, -0.6406, -0.8789]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.0505, 0.0503, 0.1973, 0.1299, 0.0549]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-0.4688, -0.4863, -0.2578, -0.2832, -0.4590]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0762, -0.0835,  0.0400,  0.0439, -0.0737]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:41,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:48:41,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.63 | bwd_microstep: 69.00 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 68.21 | step_microstep: 1.46
[2025-11-06 17:48:41,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.28 | bwd: 69.96 | bwd_inner: 1.60 | bwd_allreduce: 68.24 | step: 1.53
  2%|▏         | 75/3507 [03:55<1:13:10,  1.28s/it]                                                   {'loss': 1.5244, 'learning_rate': 1.4150943396226415e-05, 'epoch': 0.02}
  2%|▏         | 75/3507 [03:55<1:13:10,  1.28s/it]tensor([[-0.7852, -0.8164, -0.5352, -0.5781, -0.7695]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.3184, 0.3242, 0.4199, 0.4277, 0.3105]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3125, -2.3906, -1.9766, -1.8906, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3203, -0.3340, -0.1924, -0.1348, -0.3164]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.4805, 0.4961, 0.5742, 0.5508, 0.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.0119, 0.0064, 0.1592, 0.1494, 0.0103]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:43,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.36 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[0.7344, 0.7539, 0.7773, 0.7812, 0.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.7383, -0.7656, -0.5469, -0.5195, -0.7227]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:44,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 17:48:44,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.57 | bwd_microstep: 431.51 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 430.25 | step_microstep: 544.88
[2025-11-06 17:48:44,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.96 | bwd: 432.58 | bwd_inner: 2.10 | bwd_allreduce: 430.31 | step: 544.98
  2%|▏         | 76/3507 [03:58<1:42:01,  1.78s/it]                                                   {'loss': 1.501, 'learning_rate': 1.4339622641509435e-05, 'epoch': 0.02}
  2%|▏         | 76/3507 [03:58<1:42:01,  1.78s/it]tensor([[-1.0000, -1.0469, -0.7109, -0.7578, -0.9805]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.5117, -0.5312, -0.3066, -0.2988, -0.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0044, -0.0062,  0.1387,  0.1118, -0.0019]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:44,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.99 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.5078, -1.5703, -1.2266, -1.1484, -1.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0547, -1.1016, -0.8320, -0.7930, -1.0391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3887, 0.4004, 0.4688, 0.4961, 0.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.0212, -0.0267,  0.1196,  0.0854, -0.0183]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2129, -0.2197,  0.0161, -0.0879, -0.2002]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:48:45,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:48:45,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.36 | bwd_microstep: 147.37 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 146.08 | step_microstep: 1.87
[2025-11-06 17:48:45,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.38 | bwd: 148.11 | bwd_inner: 1.89 | bwd_allreduce: 146.11 | step: 1.94
  2%|▏         | 77/3507 [03:59<1:19:57,  1.40s/it]                                                   {'loss': 1.4814, 'learning_rate': 1.4528301886792452e-05, 'epoch': 0.02}
  2%|▏         | 77/3507 [03:59<1:19:57,  1.40s/it]tensor([[-2.3906, -2.4844, -2.0000, -1.9375, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3145, 0.3320, 0.4609, 0.3809, 0.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0703, -1.1094, -0.7891, -0.7695, -1.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8203, -0.8516, -0.5508, -0.5898, -0.8008]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0698, 0.0679, 0.1777, 0.1992, 0.0654]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9570, -0.9883, -0.6680, -0.7266, -0.9336]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.1006, 0.1055, 0.2373, 0.2236, 0.0996]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:46,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.60 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.3105, 0.3262, 0.3848, 0.4180, 0.3066]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:47,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 17:48:47,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.93
[2025-11-06 17:48:47,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.28 | bwd: 3.05 | bwd_inner: 2.10 | bwd_allreduce: 0.84 | step: 2.01
  2%|▏         | 78/3507 [04:00<1:29:51,  1.57s/it]                                                   {'loss': 1.4717, 'learning_rate': 1.4716981132075472e-05, 'epoch': 0.02}
  2%|▏         | 78/3507 [04:00<1:29:51,  1.57s/it]tensor([[-0.6641, -0.6914, -0.4688, -0.4121, -0.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3027, -0.3164, -0.0894, -0.1221, -0.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.0952, -0.0986,  0.0713,  0.0708, -0.0928]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.4473, 0.4590, 0.5938, 0.5312, 0.4414]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:47,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.48 | bwd_microstep: 1.34 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.4141, -0.4336, -0.2715, -0.2002, -0.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5156, -0.5352, -0.2383, -0.3320, -0.4980]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-0.2695, -0.2871, -0.0972, -0.0825, -0.2695]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0598, -0.0688,  0.0859,  0.1040, -0.0596]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:47,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:48:47,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.68
[2025-11-06 17:48:47,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 514.82 | bwd: 3.30 | bwd_inner: 2.29 | bwd_allreduce: 0.89 | step: 2.78
  2%|▏         | 79/3507 [04:01<1:12:30,  1.27s/it]                                                   {'loss': 1.5215, 'learning_rate': 1.4905660377358491e-05, 'epoch': 0.02}
  2%|▏         | 79/3507 [04:01<1:12:30,  1.27s/it]tensor([[2.7812, 2.8906, 2.6406, 2.5156, 2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:47,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.81 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.4062, -2.5000, -2.0156, -1.9219, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.0894, 0.0884, 0.1885, 0.2314, 0.0830]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[3.7031, 3.8281, 3.5000, 3.3281, 3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1494, 0.1543, 0.3418, 0.2471, 0.1514]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.1016, 0.1001, 0.2676, 0.2354, 0.0996]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0320, 0.0288, 0.2100, 0.1768, 0.0330]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1934, -0.2002, -0.0381, -0.0271, -0.1934]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:49,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 17:48:49,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.17 | bwd_microstep: 1345.79 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1344.64 | step_microstep: 2.19
[2025-11-06 17:48:49,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.98 | bwd: 1346.79 | bwd_inner: 1.98 | bwd_allreduce: 1344.68 | step: 2.28
  2%|▏         | 80/3507 [04:03<1:19:26,  1.39s/it]                                                   {'loss': 1.5303, 'learning_rate': 1.5094339622641511e-05, 'epoch': 0.02}
  2%|▏         | 80/3507 [04:03<1:19:26,  1.39s/it]tensor([[1.7031, 1.7500, 1.6562, 1.6250, 1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -3.8281, -3.1250, -3.0156, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3281, -1.3750, -1.0156, -0.9883, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:49,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.37 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.1196, -0.1289,  0.0583,  0.0549, -0.1226]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0625, -1.0938, -0.7383, -0.7852, -1.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1289, 0.1260, 0.2676, 0.2734, 0.1191]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[6.7188, 6.9375, 6.1250, 5.9062, 6.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2109, -1.2500, -0.8633, -0.8906, -1.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:48:49,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:48:49,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.85 | bwd_microstep: 79.73 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 78.66 | step_microstep: 1.55
[2025-11-06 17:48:49,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.24 | bwd: 80.52 | bwd_inner: 1.71 | bwd_allreduce: 78.69 | step: 1.62
  2%|▏         | 81/3507 [04:03<1:04:02,  1.12s/it]                                                   {'loss': 1.5371, 'learning_rate': 1.5283018867924532e-05, 'epoch': 0.02}
  2%|▏         | 81/3507 [04:03<1:04:02,  1.12s/it]tensor([[-0.6602, -0.6875, -0.3906, -0.4219, -0.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -3.8125, -3.0938, -3.0000, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1562, -1.1953, -0.8008, -0.8789, -1.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.0835, -0.0938,  0.1060,  0.0815, -0.0840]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.4258, -0.4473, -0.2090, -0.1982, -0.4238]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5938, -1.6562, -1.3047, -1.2031, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:50,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.37 | bwd_microstep: 1.10 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4082, -0.4258, -0.2363, -0.1875, -0.4043]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.4492, 0.4590, 0.5352, 0.5586, 0.4316]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:51,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.24 | optimizer_step: 0.22
[2025-11-06 17:48:51,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.29 | bwd_microstep: 1.88 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.43
[2025-11-06 17:48:51,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.68 | bwd: 2.98 | bwd_inner: 1.93 | bwd_allreduce: 0.90 | step: 2.51
  2%|▏         | 82/3507 [04:05<1:19:09,  1.39s/it]                                                   {'loss': 1.4238, 'learning_rate': 1.547169811320755e-05, 'epoch': 0.02}
  2%|▏         | 82/3507 [04:05<1:19:09,  1.39s/it]tensor([[-0.2734, -0.2891, -0.1025, -0.0684, -0.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8203, -0.8438, -0.5391, -0.5703, -0.8008]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.2422, -0.2500, -0.0464, -0.0405, -0.2393]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2656, 0.2695, 0.4336, 0.3828, 0.2617]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:52,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 257.20 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-0.4453, -0.4609, -0.2002, -0.2188, -0.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1904, -0.1973, -0.0581,  0.0120, -0.1924]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1895, -0.1953,  0.0752, -0.0315, -0.1777]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3164, -0.3281, -0.1289, -0.1172, -0.3145]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:52,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 17:48:52,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.54 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.15
[2025-11-06 17:48:52,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.78 | bwd: 2.98 | bwd_inner: 1.83 | bwd_allreduce: 0.97 | step: 2.26
  2%|▏         | 83/3507 [04:06<1:03:00,  1.10s/it]                                                   {'loss': 1.4746, 'learning_rate': 1.5660377358490568e-05, 'epoch': 0.02}
  2%|▏         | 83/3507 [04:06<1:03:00,  1.10s/it]tensor([[-0.3848, -0.3945, -0.0952, -0.1973, -0.3711]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:52,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.88 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[0.3340, 0.3496, 0.5156, 0.4121, 0.3379]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6953, -1.7500, -1.3125, -1.2969, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0420, -0.0491,  0.1206,  0.1455, -0.0481]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6055, -0.6250, -0.3086, -0.3770, -0.5898]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1104, 0.1143, 0.2910, 0.2109, 0.1147]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2275, -0.2373, -0.0737, -0.0142, -0.2275]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0942, -0.0942,  0.1367,  0.0815, -0.0884]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:52,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 17:48:52,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.50 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.96
[2025-11-06 17:48:52,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.40 | bwd: 2.80 | bwd_inner: 1.80 | bwd_allreduce: 0.83 | step: 2.06
  2%|▏         | 84/3507 [04:06<50:57,  1.12it/s]                                                   {'loss': 1.4639, 'learning_rate': 1.5849056603773586e-05, 'epoch': 0.02}
  2%|▏         | 84/3507 [04:06<50:57,  1.12it/s]tensor([[-0.5508, -0.5664, -0.2910, -0.3223, -0.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.2969, -0.3027, -0.0527, -0.1147, -0.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:48:52,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.02 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-0.0065, -0.0063,  0.2100,  0.1099, -0.0015]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8594, -2.9531, -2.3750, -2.2656, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3594, 0.3652, 0.4629, 0.4883, 0.3477]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.7500, 0.7773, 0.7852, 0.8203, 0.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4453, -0.4629, -0.1895, -0.2256, -0.4355]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3027, -0.3105, -0.1416, -0.0762, -0.3008]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:55,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.19 | optimizer_step: 0.26
[2025-11-06 17:48:55,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.26 | bwd_microstep: 1955.08 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1953.94 | step_microstep: 1.90
[2025-11-06 17:48:55,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.29 | bwd: 1956.14 | bwd_inner: 1.96 | bwd_allreduce: 1954.02 | step: 2.02
  2%|▏         | 85/3507 [04:08<1:14:14,  1.30s/it]                                                   {'loss': 1.4863, 'learning_rate': 1.6037735849056607e-05, 'epoch': 0.02}
  2%|▏         | 85/3507 [04:08<1:14:14,  1.30s/it]tensor([[-1.3906, -1.4375, -1.0078, -1.0156, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:55,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.41 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[0.1318, 0.1387, 0.3418, 0.2598, 0.1338]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7422, -0.7656, -0.4023, -0.4961, -0.7227]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9609, -0.9883, -0.6055, -0.6641, -0.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2188, -1.2500, -0.8516, -0.8438, -1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0469, -1.0703, -0.6719, -0.7578, -1.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.7500, -0.7773, -0.4902, -0.4434, -0.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7031, -1.7578, -1.3359, -1.2656, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:48:56,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:48:56,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.85 | bwd_microstep: 869.16 | bwd_inner_microstep: 1.53 | bwd_allreduce_microstep: 867.54 | step_microstep: 1.96
[2025-11-06 17:48:56,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 278.29 | bwd: 870.06 | bwd_inner: 2.36 | bwd_allreduce: 867.57 | step: 2.03
  2%|▏         | 86/3507 [04:10<1:12:06,  1.26s/it]                                                   {'loss': 1.4512, 'learning_rate': 1.6226415094339625e-05, 'epoch': 0.02}
  2%|▏         | 86/3507 [04:10<1:12:06,  1.26s/it]tensor([[0.1357, 0.1436, 0.2676, 0.3066, 0.1289]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.6797, 0.7031, 0.7617, 0.7539, 0.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:48:56,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.01 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.0312, -1.0625, -0.7891, -0.6875, -1.0234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1826, -0.1934, -0.0190,  0.0295, -0.1865]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5586, -0.5742, -0.3730, -0.2773, -0.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5859, -1.6406, -1.1641, -1.1719, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3945, -0.4121, -0.2012, -0.1465, -0.3965]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1816, -0.1855,  0.0933, -0.0227, -0.1689]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 17:48:58,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 17:48:58,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.53 | bwd_microstep: 1616.85 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1615.76 | step_microstep: 2.00
[2025-11-06 17:48:58,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.58 | bwd: 1617.72 | bwd_inner: 1.78 | bwd_allreduce: 1615.81 | step: 2.09
  2%|▏         | 87/3507 [04:11<1:23:31,  1.47s/it]                                                   {'loss': 1.4883, 'learning_rate': 1.6415094339622643e-05, 'epoch': 0.02}
  2%|▏         | 87/3507 [04:11<1:23:31,  1.47s/it]tensor([[-1.0625, -1.0859, -0.7617, -0.7070, -1.0391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5195, -0.5273, -0.2227, -0.2734, -0.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:58,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.08 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[ 0.0011, -0.0054,  0.1914,  0.2051, -0.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7188, -0.7344, -0.3809, -0.4316, -0.6992]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.7070, 0.7305, 0.8320, 0.7539, 0.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.3730, 0.3867, 0.5547, 0.4727, 0.3652]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6055, -0.6211, -0.3398, -0.3418, -0.5898]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0344, -0.0317,  0.1582,  0.1201, -0.0352]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:48:58,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:48:58,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 99.58 | bwd_microstep: 175.73 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 174.75 | step_microstep: 1.90
[2025-11-06 17:48:58,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.68 | bwd: 176.60 | bwd_inner: 1.70 | bwd_allreduce: 174.78 | step: 1.97
  3%|▎         | 88/3507 [04:12<1:06:44,  1.17s/it]                                                   {'loss': 1.4707, 'learning_rate': 1.6603773584905664e-05, 'epoch': 0.03}
  3%|▎         | 88/3507 [04:12<1:06:44,  1.17s/it]tensor([[-2.3594, -2.4219, -1.8594, -1.8047, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0635, -0.0564,  0.2021,  0.1250, -0.0542]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[0.3633, 0.3691, 0.5117, 0.5078, 0.3496]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7617, -0.7773, -0.4023, -0.4805, -0.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:48:58,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.18 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.4570, -0.4727, -0.2676, -0.1777, -0.4570]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0388, -0.0361,  0.2080,  0.1416, -0.0349]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.6094, -0.6211, -0.3008, -0.3066, -0.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3359, -1.3828, -1.0234, -0.9297, -1.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:49:00,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:49:00,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.36 | bwd_microstep: 1689.62 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 1688.65 | step_microstep: 1.84
[2025-11-06 17:49:00,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 474.57 | bwd: 1690.57 | bwd_inner: 1.76 | bwd_allreduce: 1688.68 | step: 1.91
  3%|▎         | 89/3507 [04:14<1:24:25,  1.48s/it]                                                   {'loss': 1.4697, 'learning_rate': 1.679245283018868e-05, 'epoch': 0.03}
  3%|▎         | 89/3507 [04:14<1:24:25,  1.48s/it]tensor([[-0.6445, -0.6602, -0.3105, -0.3496, -0.6289]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2988, -0.3047, -0.0918, -0.0420, -0.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:01,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1719, -2.2344, -1.6797, -1.6250, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5234, -1.5703, -1.1641, -1.0391, -1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.0840, 0.0845, 0.2402, 0.2734, 0.0742]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2490, -0.2578, -0.0576, -0.0066, -0.2539]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4180, -0.4258, -0.1211, -0.1553, -0.4121]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[1.7344, 1.7812, 1.7031, 1.6719, 1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:01,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:49:01,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.50 | bwd_microstep: 66.79 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 65.22 | step_microstep: 1.55
[2025-11-06 17:49:01,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.09 | bwd: 67.77 | bwd_inner: 2.39 | bwd_allreduce: 65.25 | step: 1.64
  3%|▎         | 90/3507 [04:15<1:07:37,  1.19s/it]                                                   {'loss': 1.4512, 'learning_rate': 1.69811320754717e-05, 'epoch': 0.03}
  3%|▎         | 90/3507 [04:15<1:07:37,  1.19s/it]tensor([[-1.7578, -1.8125, -1.3438, -1.2578, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1147, -0.1191,  0.0581,  0.1250, -0.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:01,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.43 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.1025, -0.1045,  0.1055,  0.1543, -0.1050]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5000, -1.5312, -1.1094, -1.0547, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7188, -1.7656, -1.2266, -1.2500, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[1.9922, 2.0469, 1.9766, 1.8750, 1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2373, -0.2451, -0.0153,  0.0293, -0.2451]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.7969, 0.8203, 0.9102, 0.8633, 0.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:49:04,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 17:49:04,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.43 | bwd_microstep: 1967.95 | bwd_inner_microstep: 3.87 | bwd_allreduce_microstep: 1963.98 | step_microstep: 1.93
[2025-11-06 17:49:04,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.89 | bwd: 1969.08 | bwd_inner: 4.90 | bwd_allreduce: 1964.02 | step: 2.02
  3%|▎         | 91/3507 [04:17<1:35:20,  1.67s/it]                                                   {'loss': 1.4648, 'learning_rate': 1.716981132075472e-05, 'epoch': 0.03}
  3%|▎         | 91/3507 [04:17<1:35:20,  1.67s/it]tensor([[0.1543, 0.1689, 0.4023, 0.3105, 0.1592]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5703, -1.6172, -1.1797, -1.0703, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6172, -1.6562, -1.2031, -1.1562, -1.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:04,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.44 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.3301, -0.3320, -0.0339, -0.0781, -0.3223]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7109, -1.7578, -1.3203, -1.1719, -1.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4160, -0.4277, -0.1875, -0.1338, -0.4160]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.0099, 0.0143, 0.1914, 0.2266, 0.0058]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.7773, -0.7930, -0.4668, -0.4453, -0.7617]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:49:04,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 17:49:04,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.13 | bwd_microstep: 101.25 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 100.27 | step_microstep: 2.04
[2025-11-06 17:49:04,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.60 | bwd: 102.03 | bwd_inner: 1.54 | bwd_allreduce: 100.32 | step: 2.13
  3%|▎         | 92/3507 [04:18<1:16:12,  1.34s/it]                                                   {'loss': 1.418, 'learning_rate': 1.735849056603774e-05, 'epoch': 0.03}
  3%|▎         | 92/3507 [04:18<1:16:12,  1.34s/it]tensor([[-0.6953, -0.7148, -0.4355, -0.3574, -0.6914]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0586, -0.0635,  0.1660,  0.1748, -0.0659]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3516, -1.3828, -1.0234, -0.8750, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0156, -0.0136,  0.1748,  0.1582, -0.0172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:04,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.23 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.52 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.9609, 1.0000, 1.0859, 0.9727, 0.9492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[0.3105, 0.3184, 0.4609, 0.4902, 0.2949]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-0.9609, -0.9883, -0.6719, -0.5820, -0.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([3], device='cuda:0') 
tensor([3], device='cuda:3')
tensor([[-2.5781, -2.6406, -2.0938, -1.9062, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 17:49:07,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 17:49:07,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.33 | bwd_microstep: 2729.50 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 2728.56 | step_microstep: 2.19
[2025-11-06 17:49:07,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.58 | bwd: 2730.13 | bwd_inner: 1.37 | bwd_allreduce: 2728.61 | step: 2.26
  3%|▎         | 93/3507 [04:21<1:47:09,  1.88s/it]                                                   {'loss': 1.5, 'learning_rate': 1.7547169811320756e-05, 'epoch': 0.03}
  3%|▎         | 93/3507 [04:21<1:47:09,  1.88s/it]tensor([[-2.1250, -2.1719, -1.6797, -1.5000, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:08,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.13 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[0.0009, 0.0115, 0.2715, 0.1855, 0.0031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3008, -0.2969, -0.0041, -0.0576, -0.2910]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0630, 0.0713, 0.3184, 0.2617, 0.0615]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6094, -0.6211, -0.3457, -0.2734, -0.6055]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4844, -0.4980, -0.2393, -0.1680, -0.4883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.0070, 0.0189, 0.2754, 0.1875, 0.0103]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.7539, 0.7891, 0.9492, 0.8008, 0.7461]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:49:08,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:49:08,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 137.10 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 135.93 | step_microstep: 1.40
[2025-11-06 17:49:08,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.72 | bwd: 138.07 | bwd_inner: 1.99 | bwd_allreduce: 135.96 | step: 1.47
  3%|▎         | 94/3507 [04:22<1:24:02,  1.48s/it]                                                   {'loss': 1.417, 'learning_rate': 1.7735849056603774e-05, 'epoch': 0.03}
  3%|▎         | 94/3507 [04:22<1:24:02,  1.48s/it]tensor([[-0.8867, -0.9062, -0.6172, -0.4805, -0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.1943, 0.2031, 0.3184, 0.3945, 0.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:08,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.68 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[1.1094, 1.1484, 1.1797, 1.1562, 1.0859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0986, -0.1060,  0.0879,  0.1680, -0.1108]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1504, -0.1504,  0.1133,  0.1387, -0.1572]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6562, -2.7188, -2.0625, -1.9453, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1973, 0.2021, 0.3613, 0.4102, 0.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6836, -0.6992, -0.3379, -0.3027, -0.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:49:10,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.27
[2025-11-06 17:49:10,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.36 | bwd_microstep: 2053.57 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 2052.35 | step_microstep: 2.37
[2025-11-06 17:49:10,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.07 | bwd: 2054.55 | bwd_inner: 2.01 | bwd_allreduce: 2052.40 | step: 2.46
  3%|▎         | 95/3507 [04:24<1:39:49,  1.76s/it]                                                   {'loss': 1.4209, 'learning_rate': 1.7924528301886795e-05, 'epoch': 0.03}
  3%|▎         | 95/3507 [04:24<1:39:49,  1.76s/it]tensor([[-1.0234, -1.0469, -0.6289, -0.6094, -1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.8359, -0.8555, -0.5078, -0.4512, -0.8242]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3945, 0.4102, 0.6055, 0.5391, 0.3789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5039, -0.5117, -0.2002, -0.1758, -0.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:10,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.41 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4922, -0.5039, -0.2207, -0.1406, -0.4941]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5156, -2.5781, -1.9922, -1.8047, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2422, 0.2539, 0.4707, 0.4512, 0.2295]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1328, -1.1562, -0.6797, -0.7305, -1.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:11,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.27 | optimizer_step: 0.20
[2025-11-06 17:49:11,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.86 | bwd_microstep: 167.30 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 166.25 | step_microstep: 2.08
[2025-11-06 17:49:11,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.30 | bwd: 168.19 | bwd_inner: 1.77 | bwd_allreduce: 166.28 | step: 2.17
  3%|▎         | 96/3507 [04:25<1:18:32,  1.38s/it]                                                   {'loss': 1.3867, 'learning_rate': 1.8113207547169813e-05, 'epoch': 0.03}
  3%|▎         | 96/3507 [04:25<1:18:32,  1.38s/it]tensor([[-0.4512, -0.4648, -0.1934, -0.1182, -0.4551]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.1060, 0.1240, 0.4297, 0.3145, 0.1045]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:11,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.76 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.4219, -1.4531, -1.0234, -0.9219, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2148, -0.2002,  0.1123, -0.0064, -0.2021]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6719, -1.7188, -1.2344, -1.1172, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.7266, 0.7461, 0.8125, 0.8672, 0.6914]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1904, -0.1816,  0.1514,  0.0615, -0.1846]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.6172, -0.6328, -0.3184, -0.2559, -0.6133]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:49:12,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:49:12,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.08 | bwd_microstep: 1225.16 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1224.04 | step_microstep: 1.60
[2025-11-06 17:49:12,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.87 | bwd: 1225.94 | bwd_inner: 1.72 | bwd_allreduce: 1224.08 | step: 1.67
  3%|▎         | 97/3507 [04:26<1:23:44,  1.47s/it]                                                   {'loss': 1.3896, 'learning_rate': 1.830188679245283e-05, 'epoch': 0.03}
  3%|▎         | 97/3507 [04:26<1:23:44,  1.47s/it]tensor([[-0.2871, -0.2930, -0.0791,  0.0408, -0.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9961, -1.0156, -0.5547, -0.5547, -0.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3008, -0.2969, -0.0452,  0.0197, -0.3008]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2422, -0.2305,  0.1138,  0.0505, -0.2363]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:13,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-0.8438, -0.8516, -0.3945, -0.4590, -0.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5938, -1.6250, -1.0781, -1.0469, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6758, -0.6836, -0.2734, -0.3340, -0.6602]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6992, -0.7148, -0.3281, -0.3086, -0.6914]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:13,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:49:13,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.96 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 0.64 | step_microstep: 1.50
[2025-11-06 17:49:13,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.56 | bwd: 3.04 | bwd_inner: 2.24 | bwd_allreduce: 0.68 | step: 1.59
  3%|▎         | 98/3507 [04:27<1:05:01,  1.14s/it]                                                   {'loss': 1.3818, 'learning_rate': 1.8490566037735852e-05, 'epoch': 0.03}
  3%|▎         | 98/3507 [04:27<1:05:01,  1.14s/it]tensor([[-0.9727, -0.9922, -0.5625, -0.5664, -0.9570]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.7305, 0.7578, 0.8672, 0.8594, 0.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:13,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.66 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3906, -2.4531, -1.8125, -1.6797, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0781, -1.1016, -0.7266, -0.5859, -1.0703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1953, -1.2031, -0.6875, -0.7656, -1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.0859, 1.1250, 1.2031, 1.0859, 1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.3184, 0.3301, 0.4883, 0.5469, 0.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.5625, -0.5742, -0.2285, -0.2061, -0.5586]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:49:15,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:49:15,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.20 | bwd_microstep: 2034.97 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 2033.95 | step_microstep: 1.71
[2025-11-06 17:49:15,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.88 | bwd: 2035.93 | bwd_inner: 1.79 | bwd_allreduce: 2034.00 | step: 1.80
  3%|▎         | 99/3507 [04:29<1:27:20,  1.54s/it]                                                   {'loss': 1.4102, 'learning_rate': 1.867924528301887e-05, 'epoch': 0.03}
  3%|▎         | 99/3507 [04:29<1:27:20,  1.54s/it]tensor([[0.3242, 0.3379, 0.4883, 0.5469, 0.3008]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.1406, 1.1797, 1.1406, 1.2031, 1.1016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1152, -0.1138,  0.1338,  0.1953, -0.1270]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7500, -0.7656, -0.4746, -0.3242, -0.7461]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:16,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.01 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[0.4805, 0.5039, 0.7031, 0.6797, 0.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6797, -0.6914, -0.3066, -0.2734, -0.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2061, 0.2139, 0.4434, 0.4199, 0.1904]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4609, -1.5000, -1.0156, -0.9180, -1.4453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:16,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:49:16,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.32 | bwd_microstep: 2.41 | bwd_inner_microstep: 1.52 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.90
[2025-11-06 17:49:16,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.35 | bwd: 3.42 | bwd_inner: 2.45 | bwd_allreduce: 0.85 | step: 1.99
  3%|▎         | 100/3507 [04:30<1:09:05,  1.22s/it]                                                    {'loss': 1.4316, 'learning_rate': 1.8867924528301888e-05, 'epoch': 0.03}
  3%|▎         | 100/3507 [04:30<1:09:05,  1.22s/it]tensor([[-1.0625, -1.0781, -0.6562, -0.5977, -1.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[2.2188, 2.2812, 2.1406, 2.0625, 2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0234, -1.0469, -0.6797, -0.5117, -1.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:16,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.11 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.0786, -0.0688,  0.2695,  0.1680, -0.0762]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1709, 0.1895, 0.4766, 0.3750, 0.1709]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0630, -0.0444,  0.2676,  0.1738, -0.0593]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5352, -0.5508, -0.2285, -0.1484, -0.5352]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2500, -2.3125, -1.6797, -1.5312, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:16,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 17:49:16,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.80 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.15
[2025-11-06 17:49:16,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.92 | bwd: 2.84 | bwd_inner: 1.78 | bwd_allreduce: 0.91 | step: 2.23
  3%|▎         | 101/3507 [04:30<1:00:38,  1.07s/it]                                                    {'loss': 1.3984, 'learning_rate': 1.905660377358491e-05, 'epoch': 0.03}
  3%|▎         | 101/3507 [04:30<1:00:38,  1.07s/it]tensor([[0.1758, 0.1953, 0.4844, 0.3848, 0.1729]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -3.4062, -2.5000, -2.4219, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8516, -0.8555, -0.3848, -0.4609, -0.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1270, 0.1309, 0.3184, 0.4102, 0.1064]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2969, -0.2949,  0.0378,  0.0693, -0.3008]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[3.3750, 3.4688, 3.1875, 2.9844, 3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.2422, 1.2734, 1.2891, 1.2734, 1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:18,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 199.72 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.2734, -1.3047, -0.8516, -0.7227, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:18,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:49:18,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.47 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.93 | step_microstep: 1.96
[2025-11-06 17:49:18,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 397.18 | bwd: 3.07 | bwd_inner: 1.98 | bwd_allreduce: 0.97 | step: 2.05
  3%|▎         | 102/3507 [04:32<1:10:18,  1.24s/it]                                                    {'loss': 1.4209, 'learning_rate': 1.9245283018867927e-05, 'epoch': 0.03}
  3%|▎         | 102/3507 [04:32<1:10:18,  1.24s/it]tensor([[-0.4102, -0.4141, -0.1187, -0.0649, -0.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8516, -0.8633, -0.4297, -0.4160, -0.8359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:18,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.56 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.8320, -0.8359, -0.3301, -0.3965, -0.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[2.9062, 3.0000, 2.7812, 2.5625, 2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6484, -0.6523, -0.2363, -0.2617, -0.6367]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7188, -1.7578, -1.1797, -1.0625, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1328, -0.1309,  0.1445,  0.1953, -0.1455]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1797, -1.2031, -0.8398, -0.6406, -1.1641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:20,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:49:20,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.27 | bwd_microstep: 1120.47 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1119.37 | step_microstep: 1.69
[2025-11-06 17:49:20,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.85 | bwd: 1121.41 | bwd_inner: 1.85 | bwd_allreduce: 1119.42 | step: 1.77
  3%|▎         | 103/3507 [04:33<1:14:12,  1.31s/it]                                                    {'loss': 1.3877, 'learning_rate': 1.9433962264150945e-05, 'epoch': 0.03}
  3%|▎         | 103/3507 [04:33<1:14:12,  1.31s/it]tensor([[-1.0312, -1.0469, -0.5234, -0.5352, -1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2734, -0.2617,  0.1396,  0.0513, -0.2695]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:20,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.83 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.1914, -0.1924,  0.0281,  0.1611, -0.2070]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2188, -1.2500, -0.8164, -0.6641, -1.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3301, -0.3203,  0.0403, -0.0244, -0.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[0.2256, 0.2363, 0.3809, 0.4805, 0.2061]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.2031, -1.2266, -0.7891, -0.6797, -1.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9961, -1.0156, -0.5625, -0.4902, -0.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:49:20,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.25
[2025-11-06 17:49:21,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.28 | bwd_microstep: 554.85 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 553.40 | step_microstep: 1.87
[2025-11-06 17:49:21,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 261.13 | bwd: 555.65 | bwd_inner: 2.09 | bwd_allreduce: 553.43 | step: 1.94
  3%|▎         | 104/3507 [04:34<1:07:22,  1.19s/it]                                                    {'loss': 1.4189, 'learning_rate': 1.9622641509433963e-05, 'epoch': 0.03}
  3%|▎         | 104/3507 [04:34<1:07:22,  1.19s/it]tensor([[-0.7812, -0.7969, -0.3965, -0.2988, -0.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-1.8203, -1.8516, -1.1875, -1.2188, -1.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([1], device='cuda:1')
tensor([[0.7422, 0.7695, 0.9219, 0.9023, 0.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:21,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.50 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.2344, -2.2969, -1.6172, -1.4922, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0156, -1.0312, -0.5938, -0.6016, -0.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0564, -0.0493,  0.2080,  0.2207, -0.0623]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6250, -1.6641, -1.1250, -0.9805, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8438, -0.8672, -0.5000, -0.3945, -0.8398]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:22,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:49:22,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.08 | bwd_microstep: 1116.59 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1115.47 | step_microstep: 1.96
[2025-11-06 17:49:22,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.60 | bwd: 1117.35 | bwd_inner: 1.72 | bwd_allreduce: 1115.51 | step: 2.04
  3%|▎         | 105/3507 [04:36<1:13:22,  1.29s/it]                                                    {'loss': 1.4189, 'learning_rate': 1.9811320754716984e-05, 'epoch': 0.03}
  3%|▎         | 105/3507 [04:36<1:13:22,  1.29s/it]tensor([[0.2422, 0.2520, 0.4941, 0.5312, 0.2197]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4766, -1.5000, -0.9922, -0.9102, -1.4453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0312, -1.0312, -0.4902, -0.5742, -1.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6250, -1.6406, -0.9844, -1.0156, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:22,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.38 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[0.2324, 0.2520, 0.5000, 0.4648, 0.2168]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.3516, 1.4062, 1.5078, 1.3672, 1.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.1216, 0.1416, 0.4473, 0.3770, 0.1152]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.6523, -0.6680, -0.2891, -0.1758, -0.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:49:23,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:49:23,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.13 | bwd_microstep: 228.78 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 227.75 | step_microstep: 1.71
[2025-11-06 17:49:23,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 469.54 | bwd: 229.76 | bwd_inner: 1.84 | bwd_allreduce: 227.80 | step: 1.80
  3%|▎         | 106/3507 [04:37<1:03:54,  1.13s/it]                                                    {'loss': 1.3906, 'learning_rate': 2e-05, 'epoch': 0.03}
  3%|▎         | 106/3507 [04:37<1:03:54,  1.13s/it]tensor([[-0.3047, -0.3066, -0.0371,  0.1240, -0.3164]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-1.4453, -1.4766, -1.0156, -0.8047, -1.4297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:1')
tensor([3], device='cuda:3')
tensor([[1.1328, 1.1641, 1.2109, 1.2266, 1.0859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:23,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.64 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6406, -2.6875, -1.8672, -1.8203, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781, -2.6250, -1.9688, -1.7031, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5352, -0.5352, -0.1055, -0.0623, -0.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.1172, -0.1064,  0.2695,  0.1943, -0.1162]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1484, -1.1484, -0.5742, -0.6445, -1.1172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:24,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:49:24,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.92 | bwd_microstep: 374.32 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 373.18 | step_microstep: 1.98
[2025-11-06 17:49:24,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.59 | bwd: 375.19 | bwd_inner: 1.83 | bwd_allreduce: 373.22 | step: 2.06
  3%|▎         | 107/3507 [04:37<57:49,  1.02s/it]                                                    {'loss': 1.3447, 'learning_rate': 1.9999995733650257e-05, 'epoch': 0.03}
  3%|▎         | 107/3507 [04:37<57:49,  1.02s/it]tensor([[2.5781, 2.6562, 2.5312, 2.2812, 2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3281, -0.3027,  0.1523,  0.0098, -0.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0474, 0.0583, 0.3789, 0.3730, 0.0332]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6992, -0.7109, -0.3398, -0.1943, -0.6992]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3613, -0.3672,  0.0099,  0.0359, -0.3652]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:24,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.32 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.3691, -0.3633,  0.0371, -0.0089, -0.3613]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1211e-01, -4.1406e-01, -5.8838e-02, -2.8610e-04, -4.1602e-01]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0156, -1.0234, -0.5078, -0.4844, -1.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:25,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:49:25,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.98 | bwd_microstep: 1.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 1.05 | step_microstep: 1.84
[2025-11-06 17:49:25,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.13 | bwd: 2.86 | bwd_inner: 1.63 | bwd_allreduce: 1.09 | step: 1.93
  3%|▎         | 108/3507 [04:39<1:12:38,  1.28s/it]                                                    {'loss': 1.3994, 'learning_rate': 1.9999982934604664e-05, 'epoch': 0.03}
  3%|▎         | 108/3507 [04:39<1:12:38,  1.28s/it]tensor([[-0.6719, -0.6719, -0.1709, -0.2354, -0.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9844, -1.0000, -0.6055, -0.4082, -0.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1797, -1.2109, -0.7773, -0.5703, -1.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.2969, 0.3281, 0.5625, 0.4883, 0.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[ 0.0040,  0.0103,  0.3164,  0.3613, -0.0115]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[0.5625, 0.5977, 0.8516, 0.7812, 0.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0781, -1.1016, -0.6328, -0.5469, -1.0703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:26,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.40 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.5547, -0.5625, -0.1816, -0.0903, -0.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:27,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:49:27,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.07 | bwd_microstep: 1.88 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.99
[2025-11-06 17:49:27,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.48 | bwd: 2.81 | bwd_inner: 1.81 | bwd_allreduce: 0.89 | step: 2.08
  3%|▎         | 109/3507 [04:40<1:09:03,  1.22s/it]                                                    {'loss': 1.4258, 'learning_rate': 1.999996160287414e-05, 'epoch': 0.03}
  3%|▎         | 109/3507 [04:40<1:09:03,  1.22s/it]tensor([[0.5469, 0.5820, 0.8242, 0.7422, 0.5352]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-0.9844, -0.9961, -0.5508, -0.4219, -0.9727]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:27,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.02 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.1797, -1.2031, -0.7656, -0.6016, -1.1641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6016, -0.5977, -0.1104, -0.1523, -0.5898]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4141, -1.4219, -0.8359, -0.7930, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0488, -0.0209,  0.3828,  0.2471, -0.0444]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3672, -1.3906, -0.7734, -0.7500, -1.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3164, -0.2949,  0.1650,  0.0518, -0.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:28,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:49:28,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.88 | bwd_microstep: 1.56 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.81
[2025-11-06 17:49:28,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.92 | bwd: 2.51 | bwd_inner: 1.60 | bwd_allreduce: 0.79 | step: 1.90
  3%|▎         | 110/3507 [04:42<1:20:49,  1.43s/it]                                                    {'loss': 1.3535, 'learning_rate': 1.999993173847689e-05, 'epoch': 0.03}
  3%|▎         | 110/3507 [04:42<1:20:49,  1.43s/it]tensor([[-0.7188, -0.7109, -0.1826, -0.2559, -0.6992]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.2354, 0.2480, 0.4492, 0.5703, 0.2051]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1094, -1.1328, -0.6953, -0.4824, -1.1016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.6836, 0.7227, 0.9492, 0.8633, 0.6602]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:49:29,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.98 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.9375, -0.9336, -0.3691, -0.4531, -0.9102]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.4648, -0.4688, -0.0859, -0.0123, -0.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5547, -0.5586, -0.1182, -0.0728, -0.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.1875, -1.2031, -0.6172, -0.5703, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:49:29,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:49:29,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.99 | bwd_microstep: 33.11 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 32.28 | step_microstep: 1.53
[2025-11-06 17:49:29,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.99 | bwd: 33.75 | bwd_inner: 1.31 | bwd_allreduce: 32.32 | step: 1.60
  3%|▎         | 111/3507 [04:43<1:03:54,  1.13s/it]                                                    {'loss': 1.3545, 'learning_rate': 1.9999893341438394e-05, 'epoch': 0.03}
  3%|▎         | 111/3507 [04:43<1:03:54,  1.13s/it]tensor([[-0.6914, -0.6992, -0.3438, -0.1875, -0.6914]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1016, -1.1250, -0.6602, -0.5078, -1.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:29,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.97 | bwd_microstep: 0.58 | bwd_inner_microstep: 0.49 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.9375, -0.9492, -0.4688, -0.3711, -0.9258]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6094, -0.6133, -0.2422, -0.0693, -0.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2295, -0.2295,  0.0776,  0.2100, -0.2480]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1641, -1.1875, -0.7383, -0.5508, -1.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7344, -1.7734, -1.2266, -0.9844, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.3984, -0.3984, -0.0025,  0.0208, -0.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:33,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:49:33,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.67 | bwd_microstep: 2264.49 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2263.43 | step_microstep: 1.96
[2025-11-06 17:49:33,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.66 | bwd: 2265.07 | bwd_inner: 1.48 | bwd_allreduce: 2263.47 | step: 2.03
  3%|▎         | 112/3507 [04:47<1:50:13,  1.95s/it]                                                    {'loss': 1.2773, 'learning_rate': 1.999984641179142e-05, 'epoch': 0.03}
  3%|▎         | 112/3507 [04:47<1:50:13,  1.95s/it]tensor([[-0.6797, -0.6875, -0.2432, -0.1162, -0.6836]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:33,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.52 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.1396, 0.1631, 0.4727, 0.4824, 0.1230]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.0510, 0.0674, 0.3457, 0.3867, 0.0354]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.1904, 0.2080, 0.4551, 0.5078, 0.1699]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0352, 0.0596, 0.4180, 0.3633, 0.0276]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.5234, -1.5469, -0.9844, -0.8633, -1.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.3047, -1.3203, -0.8242, -0.6406, -1.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -2.3281, -1.5156, -1.4219, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:33,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:49:33,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.93 | bwd_microstep: 181.24 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 180.19 | step_microstep: 1.57
[2025-11-06 17:49:33,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.47 | bwd: 182.26 | bwd_inner: 1.91 | bwd_allreduce: 180.23 | step: 1.65
  3%|▎         | 113/3507 [04:47<1:26:01,  1.52s/it]                                                    {'loss': 1.4189, 'learning_rate': 1.9999790949576007e-05, 'epoch': 0.03}
  3%|▎         | 113/3507 [04:47<1:26:01,  1.52s/it]tensor([[-0.9570, -0.9570, -0.3730, -0.4004, -0.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2178, 0.2617, 0.6562, 0.4688, 0.2197]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6250, -1.6484, -1.0391, -0.8672, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0156, -2.0625, -1.4219, -1.1797, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:33,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[0.4277, 0.4688, 0.7773, 0.7188, 0.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0366, 0.0508, 0.3008, 0.4238, 0.0140]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3984, -1.4141, -0.8047, -0.7461, -1.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.4551, 0.4883, 0.7539, 0.7734, 0.4277]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:35,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:49:35,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.01 | bwd_microstep: 889.68 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 888.56 | step_microstep: 1.69
[2025-11-06 17:49:35,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.35 | bwd: 890.38 | bwd_inner: 1.66 | bwd_allreduce: 888.60 | step: 1.76
  3%|▎         | 114/3507 [04:48<1:22:21,  1.46s/it]                                                    {'loss': 1.3027, 'learning_rate': 1.9999726954839478e-05, 'epoch': 0.03}
  3%|▎         | 114/3507 [04:48<1:22:21,  1.46s/it]tensor([[-0.9375, -0.9414, -0.4375, -0.3047, -0.9336]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4609, -0.4375,  0.0131, -0.0801, -0.4434]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5742, -0.5664, -0.1270, -0.0630, -0.5742]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:35,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.16 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.2988, -0.2793,  0.1826,  0.1289, -0.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.4941, -0.4766,  0.0315, -0.0728, -0.4805]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3281, -2.3594, -1.5312, -1.4141, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.8672, -0.8438, -0.2734, -0.3730, -0.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.7305, -0.7266, -0.2090, -0.1904, -0.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:35,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:49:35,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.04 | bwd_microstep: 68.39 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 67.29 | step_microstep: 2.07
[2025-11-06 17:49:35,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.24 | bwd: 69.32 | bwd_inner: 1.87 | bwd_allreduce: 67.33 | step: 2.15
  3%|▎         | 115/3507 [04:49<1:07:43,  1.20s/it]                                                    {'loss': 1.3779, 'learning_rate': 1.999965442763644e-05, 'epoch': 0.03}
  3%|▎         | 115/3507 [04:49<1:07:43,  1.20s/it]tensor([[-0.5195, -0.4863,  0.0525, -0.0427, -0.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3906, -2.4375, -1.7109, -1.4297, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:35,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.69 | bwd_microstep: 2.24 | bwd_inner_microstep: 2.09 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-1.2422, -1.2500, -0.6719, -0.6094, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7070, -0.6875, -0.1084, -0.1797, -0.6914]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0908, 0.1064, 0.3887, 0.5078, 0.0645]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3594, 0.3887, 0.6836, 0.6875, 0.3340]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0938, -2.1250, -1.4922, -1.2031, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2344, -1.2422, -0.5898, -0.5625, -1.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:37,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:49:37,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.91 | bwd_microstep: 1319.92 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 1318.46 | step_microstep: 1.73
[2025-11-06 17:49:37,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.63 | bwd: 1322.16 | bwd_inner: 3.48 | bwd_allreduce: 1318.53 | step: 1.85
  3%|▎         | 116/3507 [04:51<1:16:44,  1.36s/it]                                                    {'loss': 1.2617, 'learning_rate': 1.9999573368028785e-05, 'epoch': 0.03}
  3%|▎         | 116/3507 [04:51<1:16:44,  1.36s/it]tensor([[-1.2344, -1.2422, -0.6406, -0.5234, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0547, -1.0625, -0.5273, -0.3496, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6562, -0.6367, -0.0737, -0.1309, -0.6445]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:37,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.18 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.1797, -1.1797, -0.5508, -0.4980, -1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1797, -1.1797, -0.4941, -0.5547, -1.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.4648, 0.4941, 0.7773, 0.8008, 0.4336]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2891, -1.2812, -0.6016, -0.6562, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.7539, -0.7305, -0.1260, -0.2412, -0.7305]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:37,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 17:49:37,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.44 | bwd_microstep: 16.32 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 15.15 | step_microstep: 1.47
[2025-11-06 17:49:37,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.65 | bwd: 17.36 | bwd_inner: 2.03 | bwd_allreduce: 15.19 | step: 1.57
  3%|▎         | 117/3507 [04:51<1:02:47,  1.11s/it]                                                    {'loss': 1.2705, 'learning_rate': 1.9999483776085665e-05, 'epoch': 0.03}
  3%|▎         | 117/3507 [04:51<1:02:47,  1.11s/it]tensor([[-0.4277, -0.4199, -0.0586,  0.1128, -0.4395]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0427, -0.0232,  0.3379,  0.4121, -0.0635]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:38,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.33 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.1797, -1.1875, -0.5586, -0.4883, -1.1641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5547, -1.5703, -0.8672, -0.7773, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7383, -0.7188, -0.0903, -0.1875, -0.7227]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.8555, -0.8555, -0.4297, -0.2002, -0.8555]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6406, -1.6719, -1.0312, -0.8008, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5781, -0.5508,  0.0352, -0.0669, -0.5586]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:43,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 17:49:43,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.40 | bwd_microstep: 4946.68 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 4945.46 | step_microstep: 2.01
[2025-11-06 17:49:43,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.77 | bwd: 4947.55 | bwd_inner: 1.93 | bwd_allreduce: 4945.50 | step: 2.08
  3%|▎         | 118/3507 [04:57<2:13:40,  2.37s/it]                                                    {'loss': 1.2383, 'learning_rate': 1.999938565188354e-05, 'epoch': 0.03}
  3%|▎         | 118/3507 [04:57<2:13:40,  2.37s/it]tensor([[-1.3984, -1.3906, -0.6484, -0.6719, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:43,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 87.88 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.5898, -0.5742, -0.0659, -0.0291, -0.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0840, -0.0552,  0.3379,  0.3438, -0.0942]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.4590, -0.4473, -0.0967,  0.0718, -0.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3047, -1.3125, -0.7109, -0.6094, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0334, 0.0591, 0.3887, 0.4492, 0.0130]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7578, -0.7422, -0.2021, -0.2100, -0.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9609, -0.9531, -0.3281, -0.3340, -0.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:43,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:49:43,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.23 | bwd_microstep: 361.17 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 360.36 | step_microstep: 1.42
[2025-11-06 17:49:43,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 278.12 | bwd: 362.00 | bwd_inner: 1.48 | bwd_allreduce: 360.40 | step: 1.49
  3%|▎         | 119/3507 [04:57<1:44:51,  1.86s/it]                                                    {'loss': 1.3164, 'learning_rate': 1.9999278995506124e-05, 'epoch': 0.03}
  3%|▎         | 119/3507 [04:57<1:44:51,  1.86s/it]tensor([[-0.6133, -0.6094, -0.1768, -0.0115, -0.6211]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.4102, 0.4434, 0.5898, 0.6914, 0.3789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9570, -0.9531, -0.3965, -0.3203, -0.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1953, -0.1650,  0.2812,  0.1982, -0.1904]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7812, -0.7617, -0.1416, -0.2412, -0.7617]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:44,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.23 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.8594, -1.8594, -1.0859, -1.0859, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7969, -1.8203, -1.1484, -0.9219, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6562, -1.6719, -0.9844, -0.7734, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:44,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:49:44,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.03 | bwd_microstep: 1.76 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.88
[2025-11-06 17:49:44,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.27 | bwd: 2.66 | bwd_inner: 1.78 | bwd_allreduce: 0.77 | step: 1.95
  3%|▎         | 120/3507 [04:58<1:20:42,  1.43s/it]                                                    {'loss': 1.2695, 'learning_rate': 1.999916380704443e-05, 'epoch': 0.03}
  3%|▎         | 120/3507 [04:58<1:20:42,  1.43s/it]tensor([[0.0488, 0.0713, 0.3438, 0.4746, 0.0223]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9375, -0.9414, -0.4902, -0.2354, -0.9414]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2969, -1.2969, -0.6445, -0.5508, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2949, -0.2734,  0.2090,  0.2227, -0.3008]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:44,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.36 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.7656, -0.7461, -0.1348, -0.1514, -0.7539]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2422, -1.2344, -0.5195, -0.5039, -1.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2178, 0.2734, 0.6328, 0.4844, 0.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4062, -1.4219, -0.8711, -0.6250, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:44,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:49:44,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.84 | bwd_microstep: 24.76 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 23.67 | step_microstep: 2.18
[2025-11-06 17:49:44,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.23 | bwd: 25.65 | bwd_inner: 1.83 | bwd_allreduce: 23.70 | step: 2.25
  3%|▎         | 121/3507 [04:58<1:04:06,  1.14s/it]                                                    {'loss': 1.2646, 'learning_rate': 1.9999040086596748e-05, 'epoch': 0.03}
  3%|▎         | 121/3507 [04:58<1:04:06,  1.14s/it]tensor([[-1.4375, -1.4219, -0.6562, -0.7070, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:44,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.25 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.3281, -1.3438, -0.8086, -0.5664, -1.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7812, -0.7695, -0.1777, -0.1279, -0.7773]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719, -2.1719, -1.2734, -1.2422, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6289, -0.5938, -0.0087, -0.0757, -0.6133]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.7812, -0.7539, -0.1855, -0.1914, -0.7695]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.3633, 0.4102, 0.7188, 0.6602, 0.3457]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.2812, 1.3516, 1.5000, 1.3906, 1.2422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:46,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:49:46,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.91 | bwd_microstep: 1265.42 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1264.40 | step_microstep: 1.71
[2025-11-06 17:49:46,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.19 | bwd: 1266.27 | bwd_inner: 1.70 | bwd_allreduce: 1264.44 | step: 1.79
  3%|▎         | 122/3507 [05:00<1:12:30,  1.29s/it]                                                    {'loss': 1.291, 'learning_rate': 1.999890783426864e-05, 'epoch': 0.03}
  3%|▎         | 122/3507 [05:00<1:12:30,  1.29s/it]tensor([[0.1079, 0.1641, 0.6055, 0.4590, 0.1079]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[ 0.0078,  0.0410,  0.3750,  0.3711, -0.0033]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4004, -0.3750,  0.1846,  0.1396, -0.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5703, -0.5430, -0.0151, -0.0786, -0.5586]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:47,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.19 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-0.1387, -0.1172,  0.2539,  0.3574, -0.1602]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[ 0.0035,  0.0283,  0.3359,  0.4785, -0.0219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5312, -0.5000,  0.0258,  0.0264, -0.5273]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6016, -1.6016, -0.8867, -0.8008, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:49:49,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.23 | optimizer_step: 0.34
[2025-11-06 17:49:49,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.56 | bwd_microstep: 2583.82 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2582.61 | step_microstep: 2.49
[2025-11-06 17:49:49,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.69 | bwd: 2584.83 | bwd_inner: 1.99 | bwd_allreduce: 2582.68 | step: 2.60
  4%|▎         | 123/3507 [05:03<1:50:17,  1.96s/it]                                                    {'loss': 1.3682, 'learning_rate': 1.9998767050172955e-05, 'epoch': 0.04}
  4%|▎         | 123/3507 [05:03<1:50:17,  1.96s/it]tensor([[-0.9414, -0.9375, -0.3906, -0.2578, -0.9414]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8555, -0.8516, -0.3867, -0.1406, -0.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:50,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.29 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.0742, -0.0287,  0.4570,  0.3535, -0.0728]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8242, -0.8086, -0.2695, -0.2148, -0.8164]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2656, -1.2500, -0.4883, -0.5000, -1.2422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1709, 0.1973, 0.5430, 0.6094, 0.1426]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.9492, -0.9414, -0.4004, -0.2910, -0.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5938, -0.5664, -0.0116, -0.0306, -0.5898]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:50,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:49:50,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.85 | bwd_microstep: 113.39 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 112.30 | step_microstep: 1.57
[2025-11-06 17:49:50,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.16 | bwd: 114.25 | bwd_inner: 1.81 | bwd_allreduce: 112.33 | step: 1.64
  4%|▎         | 124/3507 [05:04<1:26:03,  1.53s/it]                                                    {'loss': 1.2832, 'learning_rate': 1.9998617734429815e-05, 'epoch': 0.04}
  4%|▎         | 124/3507 [05:04<1:26:03,  1.53s/it]tensor([[0.0674, 0.1045, 0.5508, 0.5352, 0.0515]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:50,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.23 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.7539, -0.7461, -0.2412, -0.0854, -0.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1094, -1.0938, -0.3574, -0.3555, -1.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0469, -1.0234, -0.3301, -0.3867, -1.0234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[0.9727, 1.0391, 1.2812, 1.2422, 0.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0781, -2.0781, -1.1562, -1.0938, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1016, -0.0635,  0.4590,  0.3926, -0.1060]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4590, -0.4395, -0.0020,  0.0605, -0.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:52,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.23
[2025-11-06 17:49:52,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.05 | bwd_microstep: 2189.42 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 2188.28 | step_microstep: 1.96
[2025-11-06 17:49:52,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.29 | bwd: 2190.31 | bwd_inner: 1.85 | bwd_allreduce: 2188.33 | step: 2.05
  4%|▎         | 125/3507 [05:06<1:43:08,  1.83s/it]                                                    {'loss': 1.3867, 'learning_rate': 1.9998459887166635e-05, 'epoch': 0.04}
  4%|▎         | 125/3507 [05:06<1:43:08,  1.83s/it]tensor([[0.8672, 0.9336, 1.2109, 1.1484, 0.8320]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1104, 0.1670, 0.6250, 0.5742, 0.0991]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1089, -0.0903,  0.2910,  0.4629, -0.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:53,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.52 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2188, -1.2109, -0.4551, -0.4414, -1.1953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0845, -0.0459,  0.4219,  0.4219, -0.0967]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0156, -2.0469, -1.3281, -1.0156, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-0.0544, -0.0047,  0.5000,  0.4004, -0.0586]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.4746, 0.5117, 0.7734, 0.8477, 0.4355]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:49:53,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:49:53,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.15 | bwd_microstep: 35.18 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 34.12 | step_microstep: 1.44
[2025-11-06 17:49:53,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.70 | bwd: 36.04 | bwd_inner: 1.75 | bwd_allreduce: 34.16 | step: 1.52
  4%|▎         | 126/3507 [05:07<1:19:30,  1.41s/it]                                                    {'loss': 1.418, 'learning_rate': 1.9998293508518096e-05, 'epoch': 0.04}
  4%|▎         | 126/3507 [05:07<1:19:30,  1.41s/it]tensor([[0.0840, 0.1484, 0.6758, 0.5000, 0.0840]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7656, -0.7500, -0.1924, -0.0767, -0.7695]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0234, -1.0156, -0.4180, -0.2539, -1.0234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4805, -0.4355,  0.1885,  0.0510, -0.4609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:53,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.40 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2188, -1.2109, -0.5898, -0.4336, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7773, -0.7734, -0.3008, -0.0894, -0.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8750, -2.9062, -1.7734, -1.6172, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0938, -2.0938, -1.1172, -1.0625, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:54,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:49:54,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.26 | bwd_microstep: 370.87 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 369.76 | step_microstep: 2.07
[2025-11-06 17:49:54,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.69 | bwd: 371.70 | bwd_inner: 1.78 | bwd_allreduce: 369.79 | step: 2.15
  4%|▎         | 127/3507 [05:08<1:09:19,  1.23s/it]                                                    {'loss': 1.2188, 'learning_rate': 1.9998118598626163e-05, 'epoch': 0.04}
  4%|▎         | 127/3507 [05:08<1:09:19,  1.23s/it]tensor([[-0.3770, -0.3457,  0.1289,  0.1826, -0.3789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1250, -2.1406, -1.3750, -1.0391, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1562, -1.1484, -0.4688, -0.3477, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8242, -0.8086, -0.2324, -0.1138, -0.8242]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:54,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.93 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.9375, -1.9453, -1.1406, -0.8906, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1943, -0.1670,  0.2334,  0.3867, -0.2158]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.0317, -0.0031,  0.4004,  0.5391, -0.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7812, -1.7891, -1.0703, -0.8633, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:49:54,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:49:54,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.43 | bwd_microstep: 4.00 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2.93 | step_microstep: 1.33
[2025-11-06 17:49:54,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.39 | bwd: 4.99 | bwd_inner: 1.90 | bwd_allreduce: 2.97 | step: 1.43
  4%|▎         | 128/3507 [05:08<54:59,  1.02it/s]                                                    {'loss': 1.2393, 'learning_rate': 1.9997935157640085e-05, 'epoch': 0.04}
  4%|▎         | 128/3507 [05:08<54:59,  1.02it/s]tensor([[-1.0312, -1.0078, -0.2949, -0.3008, -1.0078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7656, -0.7227, -0.0256, -0.1465, -0.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1484, -1.1406, -0.4609, -0.3672, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:54,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.29 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.0938, 0.1436, 0.6680, 0.6172, 0.0786]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7266, -0.6914, -0.0136, -0.0128, -0.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.0610, -0.0089,  0.4785,  0.4355, -0.0630]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6367, -0.6094,  0.0469,  0.0801, -0.6367]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9805, -0.9766, -0.4375, -0.2285, -0.9805]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:49:56,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:49:56,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.18 | bwd_microstep: 1333.92 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1332.69 | step_microstep: 1.63
[2025-11-06 17:49:56,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.49 | bwd: 1334.75 | bwd_inner: 1.90 | bwd_allreduce: 1332.73 | step: 1.71
  4%|▎         | 129/3507 [05:10<1:14:05,  1.32s/it]                                                    {'loss': 1.2471, 'learning_rate': 1.9997743185716386e-05, 'epoch': 0.04}
  4%|▎         | 129/3507 [05:10<1:14:05,  1.32s/it]tensor([[0.0947, 0.1318, 0.5664, 0.6367, 0.0659]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:56,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.95 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[2.5000, 2.6250, 2.6406, 2.2969, 2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1191, 0.1914, 0.6602, 0.5430, 0.1172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
tensor([[-2.2344, -2.2500, -1.2969, -1.1250, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6250, -1.6250, -0.7852, -0.6562, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3047, 0.3613, 0.7617, 0.7188, 0.2871]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3438, -0.3027,  0.2676,  0.2295, -0.3418]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.0928, -0.0654,  0.3066,  0.4766, -0.1196]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:49:57,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:49:57,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.64 | bwd_microstep: 654.97 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 653.85 | step_microstep: 1.65
[2025-11-06 17:49:57,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.61 | bwd: 655.93 | bwd_inner: 1.91 | bwd_allreduce: 653.89 | step: 1.72
  4%|▎         | 130/3507 [05:11<1:09:55,  1.24s/it]                                                    {'loss': 1.3359, 'learning_rate': 1.999754268301887e-05, 'epoch': 0.04}
  4%|▎         | 130/3507 [05:11<1:09:55,  1.24s/it]tensor([[-2.1719, -2.2031, -1.3359, -1.0234, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:49:57,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.90 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.2021, -0.1455,  0.4512,  0.3145, -0.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8359, -0.8125, -0.3008, -0.1133, -0.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-0.6289, -0.5703,  0.1865,  0.0349, -0.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7969, -1.8047, -1.0625, -0.7617, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-0.7734, -0.7500, -0.1621, -0.0267, -0.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5547, -0.5156,  0.1484,  0.1084, -0.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.4043, -0.3652,  0.1924,  0.2285, -0.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:49:58,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:49:58,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.69 | bwd_microstep: 157.31 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 156.31 | step_microstep: 1.39
[2025-11-06 17:49:58,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.60 | bwd: 158.16 | bwd_inner: 1.70 | bwd_allreduce: 156.34 | step: 1.46
  4%|▎         | 131/3507 [05:12<1:05:23,  1.16s/it]                                                    {'loss': 1.4004, 'learning_rate': 1.9997333649718614e-05, 'epoch': 0.04}
  4%|▎         | 131/3507 [05:12<1:05:23,  1.16s/it]tensor([[-0.4395, -0.4277,  0.0310,  0.2246, -0.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1514, -0.1016,  0.4570,  0.4473, -0.1602]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:49:58,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.11 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.0532, 0.0840, 0.3965, 0.4980, 0.0327]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.6602, 0.7500, 1.1641, 0.9492, 0.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-0.7422, -0.7188, -0.0311,  0.0199, -0.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5312, -1.5078, -0.5898, -0.6211, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.0449, 0.0752, 0.4297, 0.5586, 0.0150]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2988, -0.2578,  0.2910,  0.3730, -0.3086]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:50:00,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:50:00,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 72.07 | bwd_microstep: 1320.52 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1319.37 | step_microstep: 1.57
[2025-11-06 17:50:00,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.19 | bwd: 1321.42 | bwd_inner: 1.87 | bwd_allreduce: 1319.42 | step: 1.66
  4%|▍         | 132/3507 [05:14<1:14:45,  1.33s/it]                                                    {'loss': 1.3096, 'learning_rate': 1.9997116085993986e-05, 'epoch': 0.04}
  4%|▍         | 132/3507 [05:14<1:14:45,  1.33s/it]tensor([[0.0889, 0.1230, 0.5234, 0.6211, 0.0586]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5430, -0.5156,  0.0674,  0.1484, -0.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -6.1562, -4.2812, -3.9375, -5.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:00,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.46 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.4297, -1.4062, -0.5391, -0.4766, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8164, -0.7773, -0.0889, -0.1377, -0.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[1.0469, 1.1328, 1.4453, 1.2812, 1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4766, -1.4844, -0.7812, -0.4824, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5547, -0.5312, -0.0015,  0.2080, -0.5664]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:50:00,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 17:50:00,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.02 | bwd_microstep: 107.51 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 106.46 | step_microstep: 1.63
[2025-11-06 17:50:00,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.51 | bwd: 108.55 | bwd_inner: 1.91 | bwd_allreduce: 106.50 | step: 1.73
  4%|▍         | 133/3507 [05:14<1:00:19,  1.07s/it]                                                    {'loss': 1.2178, 'learning_rate': 1.9996889992030627e-05, 'epoch': 0.04}
  4%|▍         | 133/3507 [05:14<1:00:19,  1.07s/it]tensor([[-2.4219, -2.4375, -1.4844, -1.1953, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1816, -0.1338,  0.4297,  0.4121, -0.1885]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0156, -0.9766, -0.1895, -0.2217, -0.9883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:01,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.80 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-0.4375, -0.4141,  0.0209,  0.2354, -0.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.6289, -0.5938,  0.0825,  0.0967, -0.6211]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0469, -1.0078, -0.2207, -0.2656, -1.0234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -3.4375, -2.1562, -1.9219, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4375, -1.4297, -0.7422, -0.4883, -1.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 17:50:03,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 17:50:03,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.17 | bwd_microstep: 1316.84 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1315.82 | step_microstep: 2.16
[2025-11-06 17:50:03,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.94 | bwd: 1317.87 | bwd_inner: 1.82 | bwd_allreduce: 1315.87 | step: 2.26
  4%|▍         | 134/3507 [05:17<1:22:10,  1.46s/it]                                                    {'loss': 1.2803, 'learning_rate': 1.9996655368021455e-05, 'epoch': 0.04}
  4%|▍         | 134/3507 [05:17<1:22:10,  1.46s/it]tensor([[-0.0957, -0.0349,  0.4902,  0.4297, -0.0952]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[5.1250, 5.3125, 4.6875, 4.2812, 4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9609, -0.9414, -0.2598, -0.1328, -0.9492]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:03,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.07 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.3965, -0.3555,  0.2969,  0.3086, -0.3984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -4.4062, -3.0469, -2.5781, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7031, -1.6953, -0.9023, -0.6836, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0078, -1.0000, -0.3535, -0.0903, -1.0078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.8516, 0.9219, 1.2109, 1.0781, 0.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:50:03,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:50:03,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.52 | bwd_microstep: 35.25 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 34.20 | step_microstep: 3.04
[2025-11-06 17:50:03,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.62 | bwd: 36.10 | bwd_inner: 1.73 | bwd_allreduce: 34.24 | step: 3.13
  4%|▍         | 135/3507 [05:17<1:05:22,  1.16s/it]                                                    {'loss': 1.3594, 'learning_rate': 1.9996412214166667e-05, 'epoch': 0.04}
  4%|▍         | 135/3507 [05:17<1:05:22,  1.16s/it]tensor([[-0.2773, -0.2324,  0.3535,  0.3457, -0.2832]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6094, -0.5859,  0.0164,  0.1309, -0.6133]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2734, -1.2500, -0.4277, -0.4023, -1.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4492, -0.4062,  0.2451,  0.2451, -0.4492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1172, -1.0938, -0.3184, -0.2852, -1.0859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:04,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.54 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[0.5742, 0.6289, 0.9570, 1.0312, 0.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3281, -2.3438, -1.3125, -1.1328, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5312, -1.5312, -0.7461, -0.5195, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:50:06,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 17:50:06,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.53 | bwd_microstep: 1214.14 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1212.92 | step_microstep: 4.90
[2025-11-06 17:50:06,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.00 | bwd: 1215.13 | bwd_inner: 2.04 | bwd_allreduce: 1212.96 | step: 4.97
  4%|▍         | 136/3507 [05:20<1:27:00,  1.55s/it]                                                    {'loss': 1.1963, 'learning_rate': 1.9996160530673735e-05, 'epoch': 0.04}
  4%|▍         | 136/3507 [05:20<1:27:00,  1.55s/it]tensor([[-1.1484, -1.1406, -0.4453, -0.2637, -1.1328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4238, -0.3809,  0.2578,  0.2539, -0.4199]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9883, -0.9414, -0.0601, -0.1904, -0.9492]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9453, -0.8828,  0.0022, -0.1738, -0.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0234, -0.9727, -0.0957, -0.1758, -0.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:06,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.60 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.23
tensor([[ 0.0194,  0.0525,  0.4414,  0.6172, -0.0076]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9141, -1.8906, -0.8477, -0.8828, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.1328, -1.1172, -0.3301, -0.1641, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:50:06,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 17:50:06,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.73 | bwd_microstep: 60.55 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 59.60 | step_microstep: 2.22
[2025-11-06 17:50:06,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 577.37 | bwd: 61.47 | bwd_inner: 1.65 | bwd_allreduce: 59.65 | step: 2.45
  4%|▍         | 137/3507 [05:20<1:12:42,  1.29s/it]                                                    {'loss': 1.1768, 'learning_rate': 1.9995900317757423e-05, 'epoch': 0.04}
  4%|▍         | 137/3507 [05:20<1:12:42,  1.29s/it]tensor([[-0.8789, -0.8281, -0.0114, -0.1592, -0.8398]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:07,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.54 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[0.1357, 0.2090, 0.8594, 0.6641, 0.1318]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6133, -0.5547,  0.1729,  0.1030, -0.5898]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.2734, 0.3496, 0.8633, 0.7656, 0.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9922, -0.9844, -0.3320, -0.0962, -0.9883]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5078, -1.5000, -0.7422, -0.4922, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6875, -0.6562,  0.0557,  0.0850, -0.6758]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9922, -0.9414, -0.0474, -0.2080, -0.9492]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:07,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.24 | optimizer_step: 0.23
[2025-11-06 17:50:07,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.82 | bwd_microstep: 7.09 | bwd_inner_microstep: 6.10 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.16
[2025-11-06 17:50:07,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 259.37 | bwd: 7.81 | bwd_inner: 6.73 | bwd_allreduce: 0.93 | step: 2.25
  4%|▍         | 138/3507 [05:21<1:07:33,  1.20s/it]                                                    {'loss': 1.165, 'learning_rate': 1.9995631575639752e-05, 'epoch': 0.04}
  4%|▍         | 138/3507 [05:21<1:07:33,  1.20s/it]tensor([[-0.6953, -0.6758, -0.1001,  0.0840, -0.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1021, -0.0664,  0.4160,  0.5625, -0.1270]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3008, -0.2656,  0.3047,  0.3965, -0.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:08,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.69 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.3125, -4.3750, -2.9062, -2.5156, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[1.3516, 1.4297, 1.5859, 1.5234, 1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.4844, -0.4180,  0.2988,  0.1221, -0.4551]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2266, -1.1719, -0.1934, -0.3477, -1.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -3.6250, -2.2500, -2.0312, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:50:08,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 17:50:08,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.56 | bwd_microstep: 305.30 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 304.45 | step_microstep: 2.14
[2025-11-06 17:50:08,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.24 | bwd: 306.18 | bwd_inner: 1.51 | bwd_allreduce: 304.51 | step: 2.24
  4%|▍         | 139/3507 [05:22<59:07,  1.05s/it]                                                    {'loss': 1.1377, 'learning_rate': 1.9995354304550038e-05, 'epoch': 0.04}
  4%|▍         | 139/3507 [05:22<59:07,  1.05s/it]tensor([[-2.5469, -2.5469, -1.2891, -1.2031, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2354, -0.2002,  0.3242,  0.4922, -0.2539]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3516, -1.3438, -0.5938, -0.3574, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3594, -2.3438, -1.1562, -1.0938, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7852, -0.7305,  0.1387,  0.0505, -0.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:09,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.91 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2969, -1.2812, -0.4180, -0.3652, -1.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2109, -1.1641, -0.1553, -0.2871, -1.1641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4375, -3.4844, -2.1875, -1.8750, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:11,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:50:11,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.05 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.89
[2025-11-06 17:50:11,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.96 | bwd: 2.88 | bwd_inner: 1.90 | bwd_allreduce: 0.85 | step: 2.97
  4%|▍         | 140/3507 [05:25<1:27:59,  1.57s/it]                                                    {'loss': 1.0972, 'learning_rate': 1.9995068504724863e-05, 'epoch': 0.04}
  4%|▍         | 140/3507 [05:25<1:27:59,  1.57s/it]tensor([[-0.6914, -0.6680, -0.1206,  0.1162, -0.6836]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:11,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.62 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.1562, -1.1328, -0.3281, -0.1982, -1.1328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7109, -1.6953, -0.6602, -0.5664, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4531, -1.4219, -0.4512, -0.4219, -1.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.4219, -1.4219, -0.6680, -0.4219, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2344, -2.2188, -1.0859, -1.0938, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[3.6250, 3.7500, 3.3750, 3.1875, 3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[1.2109, 1.2969, 1.4922, 1.5078, 1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 17:50:11,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 17:50:11,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.34 | bwd_microstep: 45.82 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 44.79 | step_microstep: 1.77
[2025-11-06 17:50:11,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.99 | bwd: 46.63 | bwd_inner: 1.63 | bwd_allreduce: 44.85 | step: 1.86
  4%|▍         | 141/3507 [05:25<1:09:38,  1.24s/it]                                                    {'loss': 1.3047, 'learning_rate': 1.99947741764081e-05, 'epoch': 0.04}
  4%|▍         | 141/3507 [05:25<1:09:38,  1.24s/it]tensor([[0.2217, 0.2773, 0.7734, 0.7773, 0.1973]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1719, -1.1250, -0.2520, -0.2891, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3516, -1.3203, -0.3926, -0.3145, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4199, -0.3574,  0.4492,  0.2930, -0.4023]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:12,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.43 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.2461, -0.2061,  0.3535,  0.4570, -0.2559]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7852, -0.7578, -0.0850,  0.0522, -0.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2168, 0.2773, 0.7500, 0.7383, 0.1973]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9805, -0.9492, -0.0581, -0.0620, -0.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:14,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 17:50:14,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.76 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.96 | step_microstep: 194.65
[2025-11-06 17:50:14,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 517.23 | bwd: 3.12 | bwd_inner: 1.97 | bwd_allreduce: 0.99 | step: 194.72
  4%|▍         | 142/3507 [05:28<1:38:49,  1.76s/it]                                                    {'loss': 1.2139, 'learning_rate': 1.999447131985088e-05, 'epoch': 0.04}
  4%|▍         | 142/3507 [05:28<1:38:49,  1.76s/it]tensor([[-0.7578, -0.7305, -0.0879,  0.1416, -0.7461]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7148, -0.6875, -0.0752,  0.2002, -0.7109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3281, -0.2969,  0.2373,  0.4199, -0.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3672, -1.3672, -0.5391, -0.2852, -1.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:15,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.66 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-9.5312e-01, -9.2969e-01, -1.5625e-01,  4.9591e-04, -9.3750e-01]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.3105, -0.2734,  0.2871,  0.3906, -0.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5625, -1.5547, -0.7891, -0.5352, -1.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-0.8789, -0.8164,  0.0435, -0.0271, -0.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:50:15,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:50:15,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.88 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.71 | step_microstep: 1.25
[2025-11-06 17:50:15,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.56 | bwd: 2.57 | bwd_inner: 1.71 | bwd_allreduce: 0.74 | step: 1.32
  4%|▍         | 143/3507 [05:29<1:15:44,  1.35s/it]                                                    {'loss': 1.3682, 'learning_rate': 1.9994159935311633e-05, 'epoch': 0.04}
  4%|▍         | 143/3507 [05:29<1:15:44,  1.35s/it]tensor([[-1.5547, -1.5391, -0.6328, -0.4160, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9453, -1.9375, -0.9141, -0.7539, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1230, 0.1963, 0.8047, 0.6992, 0.1162]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5977, -0.5742, -0.0200,  0.2178, -0.5977]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:15,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.90 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3125, -2.3125, -1.1484, -0.8750, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3438, -1.3438, -0.5469, -0.2969, -1.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2227, -0.1357,  0.6758,  0.4102, -0.1973]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0859, -1.0625, -0.1484, -0.1270, -1.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:18,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:50:18,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.87 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.18
[2025-11-06 17:50:18,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 565.80 | bwd: 2.75 | bwd_inner: 1.85 | bwd_allreduce: 0.79 | step: 2.27
  4%|▍         | 144/3507 [05:32<1:43:36,  1.85s/it]                                                    {'loss': 1.1216, 'learning_rate': 1.9993840023056045e-05, 'epoch': 0.04}
  4%|▍         | 144/3507 [05:32<1:43:36,  1.85s/it]tensor([[0.9375, 1.0078, 1.2109, 1.3828, 0.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7578, -0.7148,  0.0776,  0.1318, -0.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1729, -0.1299,  0.5234,  0.6211, -0.1885]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:18,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.66 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-0.6836, -0.6562,  0.0315,  0.1426, -0.6758]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9805, -0.9219, -0.0082, -0.1494, -0.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6797, -0.6289,  0.1240,  0.0854, -0.6523]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2344, -3.2500, -1.8516, -1.6953, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.1484, -1.1250, -0.2109, -0.1079, -1.1172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:50:18,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:50:18,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.04 | bwd_microstep: 134.29 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 133.29 | step_microstep: 1.47
[2025-11-06 17:50:18,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.73 | bwd: 135.13 | bwd_inner: 1.69 | bwd_allreduce: 133.32 | step: 1.54
  4%|▍         | 145/3507 [05:32<1:21:55,  1.46s/it]                                                    {'loss': 1.1797, 'learning_rate': 1.9993511583357087e-05, 'epoch': 0.04}
  4%|▍         | 145/3507 [05:32<1:21:55,  1.46s/it]tensor([[-0.2734, -0.1807,  0.5586,  0.3438, -0.2402]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5703, -0.5234,  0.2539,  0.3145, -0.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2422, -1.2266, -0.3359, -0.1377, -1.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:18,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.81 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.7500, -0.6875,  0.2246,  0.0640, -0.7148]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6172, -1.6172, -0.7344, -0.5039, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4297, -1.4062, -0.4141, -0.3066, -1.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7656, -0.7148,  0.2070,  0.1426, -0.7383]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5391, -1.5391, -0.6406, -0.3945, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:20,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.12 | optimizer_step: 0.15
[2025-11-06 17:50:20,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.08 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.68
[2025-11-06 17:50:20,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.91 | bwd: 3.04 | bwd_inner: 2.11 | bwd_allreduce: 0.81 | step: 1.75
  4%|▍         | 146/3507 [05:34<1:21:11,  1.45s/it]                                                    {'loss': 1.0957, 'learning_rate': 1.9993174616495013e-05, 'epoch': 0.04}
  4%|▍         | 146/3507 [05:34<1:21:11,  1.45s/it]tensor([[-1.3125, -1.2891, -0.3867, -0.2197, -1.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9023, -0.8633,  0.0569,  0.0293, -0.8711]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9336, -0.9062, -0.1309, -0.0024, -0.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:20,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.03 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.7344, -0.6602,  0.1982,  0.0640, -0.6914]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.9375, -0.8906,  0.0811,  0.0194, -0.9023]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.7891, 0.8711, 1.2188, 1.1484, 0.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2656, -1.2500, -0.3867, -0.2656, -1.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.8672, -0.8203,  0.1079,  0.0732, -0.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:50:20,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:50:20,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.50 | bwd_microstep: 88.55 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 87.28 | step_microstep: 1.38
[2025-11-06 17:50:20,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.56 | bwd: 89.65 | bwd_inner: 2.21 | bwd_allreduce: 87.31 | step: 1.45
  4%|▍         | 147/3507 [05:34<1:05:25,  1.17s/it]                                                    {'loss': 1.2637, 'learning_rate': 1.9992829122757343e-05, 'epoch': 0.04}
  4%|▍         | 147/3507 [05:34<1:05:25,  1.17s/it]tensor([[-1.2031, -1.1406, -0.0967, -0.2539, -1.1484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9297, -1.9453, -1.0703, -0.6680, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)[2025-11-06 17:50:20,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.78 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
 tensor([4], device='cuda:3')
tensor([[-1.1172, -1.0938, -0.1514, -0.0664, -1.0859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1562, -2.1250, -0.9102, -0.8828, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-0.5859, -0.5312,  0.2480,  0.1279, -0.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -4.9375, -3.0469, -2.7031, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0219,  0.0427,  0.7500,  0.6562, -0.0203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.2930, -0.2266,  0.5000,  0.5000, -0.2852]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:22,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:50:22,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.80 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.95
[2025-11-06 17:50:22,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.61 | bwd: 2.99 | bwd_inner: 1.98 | bwd_allreduce: 0.89 | step: 2.05
  4%|▍         | 148/3507 [05:36<1:22:58,  1.48s/it]                                                    {'loss': 1.3638, 'learning_rate': 1.9992475102438878e-05, 'epoch': 0.04}
  4%|▍         | 148/3507 [05:36<1:22:58,  1.48s/it]tensor([[-1.2578, -1.2422, -0.4102, -0.1992, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:23,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.63 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.0156, -2.0156, -1.0234, -0.7148, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8477, -0.8125, -0.0403,  0.0830, -0.8242]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3828, -1.3672, -0.4082, -0.2559, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.8438, 0.9180, 1.2969, 1.3047, 0.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4375, -1.3906, -0.2637, -0.3164, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6328, -0.6055,  0.0630,  0.3027, -0.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.0962, -0.0518,  0.5000,  0.6367, -0.1162]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:50:23,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:50:23,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.18 | bwd_microstep: 70.39 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 69.40 | step_microstep: 2.18
[2025-11-06 17:50:23,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.81 | bwd: 71.28 | bwd_inner: 1.73 | bwd_allreduce: 69.43 | step: 2.25
  4%|▍         | 149/3507 [05:37<1:05:37,  1.17s/it]                                                    {'loss': 1.1318, 'learning_rate': 1.999211255584169e-05, 'epoch': 0.04}
  4%|▍         | 149/3507 [05:37<1:05:37,  1.17s/it]tensor([[-0.6875, -0.6094,  0.2090,  0.1123, -0.6445]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3984, -0.3652,  0.2363,  0.5195, -0.4102]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3340, -0.2969,  0.3027,  0.5195, -0.3418]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:23,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.16 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0000, -1.9844, -0.8047, -0.6914, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3125, -0.2598,  0.4570,  0.5664, -0.3164]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7695, -0.7305, -0.0435,  0.0211, -0.7383]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.3691, -0.3125,  0.3672,  0.4082, -0.3613]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6289, -0.5859,  0.1562,  0.2227, -0.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:25,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.13 | optimizer_step: 0.18
[2025-11-06 17:50:25,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.71 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.01
[2025-11-06 17:50:25,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.89 | bwd: 2.78 | bwd_inner: 1.82 | bwd_allreduce: 0.83 | step: 2.09
  4%|▍         | 150/3507 [05:38<1:14:24,  1.33s/it]                                                    {'loss': 1.1816, 'learning_rate': 1.9991741483275132e-05, 'epoch': 0.04}
  4%|▍         | 150/3507 [05:38<1:14:24,  1.33s/it]tensor([[-1.3359, -1.2969, -0.2441, -0.1709, -1.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.7070, 0.7852, 1.2188, 1.2500, 0.6680]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7852, -0.7109,  0.2266,  0.0229, -0.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.9531, -1.9375, -0.8281, -0.6836, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:25,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0938, -2.0938, -1.0078, -0.7930, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1162, -0.0640,  0.6016,  0.6836, -0.1289]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.4512, 0.5664, 1.1875, 0.9570, 0.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8008, -0.7383,  0.2373,  0.1797, -0.7617]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:25,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:50:25,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.35 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.48
[2025-11-06 17:50:25,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.97 | bwd: 2.54 | bwd_inner: 1.63 | bwd_allreduce: 0.78 | step: 1.55
  4%|▍         | 151/3507 [05:39<59:21,  1.06s/it]                                                    {'loss': 1.2407, 'learning_rate': 1.999136188505583e-05, 'epoch': 0.04}
  4%|▍         | 151/3507 [05:39<59:21,  1.06s/it]tensor([[-0.4668, -0.4238,  0.2490,  0.4023, -0.4648]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8516, -1.8281, -0.7266, -0.5586, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:25,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.70 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.2969, -1.2500, -0.1738, -0.2432, -1.2422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5820, -0.5352,  0.2090,  0.2139, -0.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[ 0.0170,  0.0674,  0.6328,  0.8281, -0.0073]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9688, -1.9609, -0.8789, -0.6094, -1.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3281, -1.2578, -0.1787, -0.2949, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7930, -0.7227,  0.2041,  0.0544, -0.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:27,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:50:27,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.41 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.12
[2025-11-06 17:50:27,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.13 | bwd: 3.03 | bwd_inner: 2.02 | bwd_allreduce: 0.86 | step: 2.21
  4%|▍         | 152/3507 [05:41<1:18:51,  1.41s/it]                                                    {'loss': 1.1064, 'learning_rate': 1.999097376150768e-05, 'epoch': 0.04}
  4%|▍         | 152/3507 [05:41<1:18:51,  1.41s/it]tensor([[-1.0703, -1.0156, -0.0091, -0.0137, -1.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.6250, 1.7109, 1.7891, 1.8516, 1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3027, 0.4062, 1.1250, 1.0469, 0.2910]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.1826, -0.1245,  0.5273,  0.6094, -0.1836]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:27,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.06 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.3945, -0.3477,  0.2578,  0.4570, -0.3887]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5781, -1.5234, -0.3945, -0.4961, -1.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0547, -0.9883,  0.0437, -0.0510, -1.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0859, -1.0312,  0.0195, -0.0126, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:28,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.14 | optimizer_step: 0.25
[2025-11-06 17:50:28,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.31 | bwd_microstep: 1.89 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.48
[2025-11-06 17:50:28,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.41 | bwd: 2.70 | bwd_inner: 1.80 | bwd_allreduce: 0.78 | step: 1.56
  4%|▍         | 153/3507 [05:41<1:00:58,  1.09s/it]                                                    {'loss': 1.2588, 'learning_rate': 1.999057711296186e-05, 'epoch': 0.04}
  4%|▍         | 153/3507 [05:41<1:00:58,  1.09s/it]tensor([[0.2812, 0.3477, 0.9453, 1.0547, 0.2539]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5781, -1.5312, -0.4004, -0.4004, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:28,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.08 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.5078, -1.4844, -0.5156, -0.2832, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1201, -0.0703,  0.5625,  0.7383, -0.1377]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3203, -1.2656, -0.1562, -0.2324, -1.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.5234, 1.6250, 1.7891, 1.7578, 1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2344, -2.2188, -1.0703, -0.8633, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0781, -2.0469, -0.7422, -0.6758, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:30,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:50:30,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.05 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1.03 | step_microstep: 1.77
[2025-11-06 17:50:30,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.10 | bwd: 3.09 | bwd_inner: 1.84 | bwd_allreduce: 1.08 | step: 1.86
  4%|▍         | 154/3507 [05:44<1:19:09,  1.42s/it]                                                    {'loss': 1.1152, 'learning_rate': 1.9990171939756815e-05, 'epoch': 0.04}
  4%|▍         | 154/3507 [05:44<1:19:09,  1.42s/it]tensor([[-1.5234, -1.4922, -0.6172, -0.2520, -1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8477, -0.8164, -0.0615,  0.2461, -0.8320]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4121, -0.3613,  0.3652,  0.5547, -0.4102]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:30,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.90 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.7422, -0.6797,  0.2305,  0.2051, -0.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5391, -1.4844, -0.2354, -0.2402, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0000, -0.9297,  0.0894, -0.0854, -0.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -3.3438, -1.8359, -1.6406, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.2119, -0.1099,  0.7852,  0.5508, -0.1865]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 17:50:30,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:50:30,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.57 | bwd_microstep: 113.31 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 112.25 | step_microstep: 1.73
[2025-11-06 17:50:30,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.48 | bwd: 114.08 | bwd_inner: 1.66 | bwd_allreduce: 112.29 | step: 1.80
  4%|▍         | 155/3507 [05:44<1:03:30,  1.14s/it]                                                    {'loss': 1.1724, 'learning_rate': 1.9989758242238268e-05, 'epoch': 0.04}
  4%|▍         | 155/3507 [05:44<1:03:30,  1.14s/it]tensor([[-0.6797, -0.6445,  0.0381,  0.3027, -0.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:30,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 102.65 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.6328, -1.6250, -0.6602, -0.3223, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4531, -0.4043,  0.3047,  0.3945, -0.4473]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000, -2.5000, -1.2031, -0.9453, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8789, -0.8203,  0.1826,  0.1807, -0.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6094, -1.5625, -0.3320, -0.3125, -1.5391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.4277, 0.5273, 1.1406, 1.0625, 0.4082]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1641, -1.1094, -0.1089, -0.0071, -1.1172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:33,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:50:33,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.57 | bwd_microstep: 1.71 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.78
[2025-11-06 17:50:33,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 258.22 | bwd: 2.55 | bwd_inner: 1.60 | bwd_allreduce: 0.80 | step: 1.87
  4%|▍         | 156/3507 [05:47<1:33:35,  1.68s/it]                                                    {'loss': 1.0874, 'learning_rate': 1.998933602075922e-05, 'epoch': 0.04}
  4%|▍         | 156/3507 [05:47<1:33:35,  1.68s/it]tensor([[-0.4883, -0.4375,  0.2295,  0.4297, -0.4785]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0312, -1.9688, -0.7383, -0.7148, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-0.4727, -0.4277,  0.2041,  0.3789, -0.4648]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1797, -1.1094,  0.0227, -0.0767, -1.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:34,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.98 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08
tensor([[-0.3691, -0.3047,  0.4785,  0.5117, -0.3613]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6953, -1.6797, -0.6367, -0.4004, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9336, -0.8984, -0.0840,  0.1279, -0.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1250, -2.0938, -0.8008, -0.6875, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:34,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:50:34,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.63 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.91
[2025-11-06 17:50:34,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.64 | bwd: 2.71 | bwd_inner: 1.77 | bwd_allreduce: 0.81 | step: 2.00
  4%|▍         | 157/3507 [05:48<1:13:53,  1.32s/it]                                                    {'loss': 1.2217, 'learning_rate': 1.998890527567993e-05, 'epoch': 0.04}
  4%|▍         | 157/3507 [05:48<1:13:53,  1.32s/it]tensor([[1.8750, 1.9922, 2.0781, 1.9531, 1.8047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4531, -1.4219, -0.4199, -0.1260, -1.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:34,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.24 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8984, -1.8906, -0.9062, -0.5586, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7344, -0.6523,  0.3125,  0.1523, -0.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8203, -0.7539,  0.2412,  0.2598, -0.7852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.0618, 0.1396, 0.8320, 0.8828, 0.0491]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2188, -2.1875, -0.8203, -0.7578, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9219, -2.9219, -1.4609, -1.2500, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:36,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 17:50:36,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.27 | bwd_microstep: 1.80 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 2.17
[2025-11-06 17:50:36,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.45 | bwd: 2.67 | bwd_inner: 1.78 | bwd_allreduce: 0.77 | step: 2.24
  5%|▍         | 158/3507 [05:50<1:35:53,  1.72s/it]                                                    {'loss': 1.1167, 'learning_rate': 1.9988466007367944e-05, 'epoch': 0.05}
  5%|▍         | 158/3507 [05:50<1:35:53,  1.72s/it]tensor([[-0.6367, -0.5391,  0.4004,  0.2676, -0.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7617, -0.7148,  0.0383,  0.2988, -0.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5156, -1.4922, -0.6172, -0.2207, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2500, -1.1562,  0.0688, -0.0728, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7422, -1.7031, -0.5625, -0.4062, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0781, -1.0547, -0.2637,  0.0786, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4258, -0.3164,  0.6055,  0.4238, -0.3867]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:37,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.28 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.7891, -0.7461,  0.0488,  0.2559, -0.7773]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:37,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:50:37,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.10 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.76
[2025-11-06 17:50:37,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.41 | bwd: 2.78 | bwd_inner: 1.79 | bwd_allreduce: 0.87 | step: 1.83
  5%|▍         | 159/3507 [05:51<1:16:09,  1.36s/it]                                                    {'loss': 1.0405, 'learning_rate': 1.9988018216198077e-05, 'epoch': 0.05}
  5%|▍         | 159/3507 [05:51<1:16:09,  1.36s/it]tensor([[-0.3340, -0.2734,  0.4922,  0.5859, -0.3359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9531, -1.9375, -0.9453, -0.5430, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2695, -0.2080,  0.4902,  0.6406, -0.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:37,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.70 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.6562, -1.5703, -0.2373, -0.3594, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1416, -0.0737,  0.6055,  0.7148, -0.1475]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6719, -1.6094, -0.3613, -0.3223, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0312, -2.0156, -0.9141, -0.5742, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438, -2.8438, -1.5703, -1.1641, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:37,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:50:37,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.87 | bwd_microstep: 1.76 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.66 | step_microstep: 1.32
[2025-11-06 17:50:37,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.60 | bwd: 2.77 | bwd_inner: 1.96 | bwd_allreduce: 0.69 | step: 1.40
  5%|▍         | 160/3507 [05:51<59:34,  1.07s/it]                                                    {'loss': 1.0015, 'learning_rate': 1.998756190255242e-05, 'epoch': 0.05}
  5%|▍         | 160/3507 [05:51<59:34,  1.07s/it]tensor([[-1.8984, -1.8594, -0.6289, -0.3672, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.4922, 0.6055, 1.3281, 1.2344, 0.4727]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:37,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.25 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.6406, -1.6094, -0.5977, -0.3125, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8867, -0.8398,  0.0806,  0.3320, -0.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2559, -0.1602,  0.6172,  0.5898, -0.2393]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8711, -0.7773,  0.2354,  0.2168, -0.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.8789, -0.8398, -0.0100,  0.2578, -0.8633]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062, -2.3594, -0.9922, -0.9297, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:50:38,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:50:38,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 103.03 | bwd_microstep: 148.88 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 147.74 | step_microstep: 1.52
[2025-11-06 17:50:38,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 263.30 | bwd: 149.84 | bwd_inner: 1.93 | bwd_allreduce: 147.79 | step: 1.60
  5%|▍         | 161/3507 [05:52<49:03,  1.14it/s]                                                  {'loss': 1.0439, 'learning_rate': 1.9987097066820324e-05, 'epoch': 0.05}
  5%|▍         | 161/3507 [05:52<49:03,  1.14it/s]tensor([[-1.4688, -1.4531, -0.5430, -0.2695, -1.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:38,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.96 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.8438, -0.7578,  0.3789,  0.2598, -0.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6758, -0.6328,  0.1128,  0.3223, -0.6680]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7695, -0.7109,  0.1709,  0.3164, -0.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0703, -0.9727,  0.1582,  0.0272, -1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.6797, -0.6289,  0.1582,  0.3789, -0.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.8516, 0.9492, 1.3516, 1.4062, 0.8086]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4609, -0.4043,  0.3594,  0.5469, -0.4609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:40,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 17:50:40,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.24 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.95 | step_microstep: 2.36
[2025-11-06 17:50:40,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.22 | bwd: 2.99 | bwd_inner: 1.84 | bwd_allreduce: 0.99 | step: 2.44
  5%|▍         | 162/3507 [05:54<1:09:52,  1.25s/it]                                                    {'loss': 1.2437, 'learning_rate': 1.9986623709398427e-05, 'epoch': 0.05}
  5%|▍         | 162/3507 [05:54<1:09:52,  1.25s/it]tensor([[-1.2891, -1.2344, -0.1396, -0.0164, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7188, -1.6484, -0.3242, -0.3438, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:40,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.65 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.0762,  0.0435,  0.9258,  0.7578, -0.0564]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.4219, -1.3984, -0.4824, -0.0967, -1.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2070, 0.2793, 0.8945, 1.0859, 0.1777]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.5078, 0.5742, 1.0000, 1.2344, 0.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7852, -0.7266,  0.2012,  0.4219, -0.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6406, -0.5898,  0.0996,  0.2773, -0.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:41,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 17:50:41,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.42 | bwd_microstep: 1.91 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.55
[2025-11-06 17:50:41,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.09 | bwd: 2.69 | bwd_inner: 1.69 | bwd_allreduce: 0.87 | step: 2.63
  5%|▍         | 163/3507 [05:55<1:10:58,  1.27s/it]                                                    {'loss': 1.0732, 'learning_rate': 1.9986141830690626e-05, 'epoch': 0.05}
  5%|▍         | 163/3507 [05:55<1:10:58,  1.27s/it]tensor([[-1.6797, -1.6406, -0.4160, -0.2217, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7617, -0.6680,  0.3438,  0.3809, -0.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438, -2.8438, -1.5781, -1.1016, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0625, -1.0234, -0.1787,  0.1299, -1.0391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:41,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.80 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.0547, -0.9531,  0.2178,  0.0330, -0.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2656, -1.1875,  0.0219, -0.0175, -1.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1953, -1.1406, -0.1221,  0.0466, -1.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1416, -0.0898,  0.5078,  0.7422, -0.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:42,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 17:50:42,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.02 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.92 | step_microstep: 1.84
[2025-11-06 17:50:42,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.84 | bwd: 2.88 | bwd_inner: 1.79 | bwd_allreduce: 0.95 | step: 1.91
  5%|▍         | 164/3507 [05:56<1:05:04,  1.17s/it]                                                    {'loss': 1.0166, 'learning_rate': 1.9985651431108095e-05, 'epoch': 0.05}
  5%|▍         | 164/3507 [05:56<1:05:04,  1.17s/it]tensor([[-2.4844, -2.4531, -1.1719, -0.8594, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:42,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.57 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1719, -2.1406, -1.0312, -0.6328, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5430, -0.4355,  0.5898,  0.4707, -0.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6406, -0.5859,  0.2129,  0.5117, -0.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5078, -1.4609, -0.3008, -0.1201, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -3.4844, -2.0000, -1.5078, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.0938, -2.0781, -0.9570, -0.5547, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6406, -1.5312, -0.1709, -0.3301, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:45,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:50:45,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.91 | bwd_microstep: 2.29 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.13
[2025-11-06 17:50:45,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.48 | bwd: 3.41 | bwd_inner: 2.39 | bwd_allreduce: 0.88 | step: 2.21
  5%|▍         | 165/3507 [05:59<1:29:46,  1.61s/it]                                                    {'loss': 1.21, 'learning_rate': 1.9985152511069274e-05, 'epoch': 0.05}
  5%|▍         | 165/3507 [05:59<1:29:46,  1.61s/it]tensor([[-0.4766, -0.3770,  0.6523,  0.5742, -0.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7188, -1.6328, -0.2344, -0.3320, -1.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9023, -0.8555, -0.0148,  0.3125, -0.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7188, -0.6484,  0.3047,  0.4531, -0.7070]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:45,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.84 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4531, -2.4219, -1.1172, -0.7930, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8867, -0.8320,  0.0562,  0.3223, -0.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9805, -0.9141, -0.0728,  0.0137, -0.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6406, -2.6094, -1.2812, -0.9180, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:45,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 17:50:45,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.21 | bwd_microstep: 2.52 | bwd_inner_microstep: 1.61 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.92
[2025-11-06 17:50:45,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.06 | bwd: 3.52 | bwd_inner: 2.53 | bwd_allreduce: 0.86 | step: 2.00
  5%|▍         | 166/3507 [05:59<1:14:54,  1.35s/it]                                                    {'loss': 0.9873, 'learning_rate': 1.998464507099988e-05, 'epoch': 0.05}
  5%|▍         | 166/3507 [05:59<1:14:54,  1.35s/it]tensor([[-1.4375, -1.3828, -0.3008, -0.0461, -1.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5156, -1.4844, -0.5078, -0.1279, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:46,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.5469, -0.4980,  0.2168,  0.5195, -0.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8906, -0.7617,  0.3945,  0.2207, -0.8320]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8203, -0.7305,  0.2344,  0.2070, -0.7773]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.8594, -0.7656,  0.2539,  0.1895, -0.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3516, -1.3125, -0.3125,  0.0178, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.7617, 0.8398, 1.2266, 1.3828, 0.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:48,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.12 | optimizer_step: 0.15
[2025-11-06 17:50:48,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.50 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.86
[2025-11-06 17:50:48,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.82 | bwd: 3.05 | bwd_inner: 2.10 | bwd_allreduce: 0.84 | step: 1.94
  5%|▍         | 167/3507 [06:01<1:28:00,  1.58s/it]                                                    {'loss': 1.0283, 'learning_rate': 1.9984129111332896e-05, 'epoch': 0.05}
  5%|▍         | 167/3507 [06:01<1:28:00,  1.58s/it]tensor([[-1.5469, -1.5078, -0.5039, -0.0889, -1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1016, -1.0547, -0.0630,  0.1885, -1.0703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:48,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.60 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.5781, -1.5156, -0.3340, -0.1465, -1.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1875, -2.1250, -0.6367, -0.4590, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0938, -2.0312, -0.6836, -0.5664, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1562, -1.0469,  0.1611,  0.2012, -1.1016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.6055, 0.6953, 1.2500, 1.3906, 0.5547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1172, -1.0703, -0.1807,  0.1177, -1.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:50:48,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:50:48,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.90 | bwd_microstep: 87.18 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 85.98 | step_microstep: 1.76
[2025-11-06 17:50:48,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.53 | bwd: 88.12 | bwd_inner: 1.97 | bwd_allreduce: 86.03 | step: 1.83
  5%|▍         | 168/3507 [06:02<1:09:42,  1.25s/it]                                                    {'loss': 1.0366, 'learning_rate': 1.9983604632508572e-05, 'epoch': 0.05}
  5%|▍         | 168/3507 [06:02<1:09:42,  1.25s/it]tensor([[-1.3125, -1.2656, -0.2598,  0.0928, -1.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:48,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.80 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.0000, -0.9531, -0.0664,  0.1699, -0.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1484, -1.0625,  0.0996,  0.1758, -1.1016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.5078, 0.6289, 1.3438, 1.3359, 0.4727]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -4.1250, -2.4375, -2.0156, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4062, -1.3359, -0.1191,  0.0481, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2305, -0.1299,  0.6523,  0.6719, -0.2158]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5391, -1.5000, -0.5195, -0.0840, -1.4922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:50:50,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:50:50,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.14 | bwd_microstep: 974.04 | bwd_inner_microstep: 1.58 | bwd_allreduce_microstep: 972.35 | step_microstep: 1.63
[2025-11-06 17:50:50,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.94 | bwd: 975.12 | bwd_inner: 2.57 | bwd_allreduce: 972.39 | step: 1.72
  5%|▍         | 169/3507 [06:04<1:24:17,  1.52s/it]                                                    {'loss': 1.0425, 'learning_rate': 1.9983071634974436e-05, 'epoch': 0.05}
  5%|▍         | 169/3507 [06:04<1:24:17,  1.52s/it]tensor([[-1.6797, -1.6406, -0.5781, -0.1445, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-0.5273, -0.4512,  0.4941,  0.6445, -0.5195]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0625, -0.9922,  0.0635,  0.2617, -1.0234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7812, -1.7109, -0.3926, -0.2422, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:50,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.12 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5547, -1.5000, -0.3164, -0.0957, -1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-0.2188, -0.1240,  0.7305,  0.7852, -0.2129]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5938, -1.5469, -0.3672, -0.1426, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.3281, 0.4688, 1.2031, 1.1016, 0.3301]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:50:51,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 17:50:51,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.59 | bwd_microstep: 16.44 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 15.38 | step_microstep: 1.52
[2025-11-06 17:50:51,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.75 | bwd: 17.25 | bwd_inner: 1.69 | bwd_allreduce: 15.42 | step: 1.60
  5%|▍         | 170/3507 [06:04<1:05:20,  1.17s/it]                                                    {'loss': 1.3691, 'learning_rate': 1.9982530119185277e-05, 'epoch': 0.05}
  5%|▍         | 170/3507 [06:04<1:05:20,  1.17s/it]tensor([[-1.0781, -0.9961,  0.0967,  0.2119, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8711, -0.7773,  0.3926,  0.3242, -0.8242]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0078, -0.9023,  0.3301,  0.2617, -0.9492]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:51,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 286.74 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.5625, -1.4688, -0.0874, -0.1504, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8125, -1.7812, -0.6172, -0.2480, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0000, -2.9531, -1.3047, -1.0078, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2656, -3.2031, -1.3516, -1.2266, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3438, -2.3125, -1.1328, -0.6914, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:50:52,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 17:50:52,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.05 | bwd_microstep: 743.75 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 742.55 | step_microstep: 1.97
[2025-11-06 17:50:52,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.81 | bwd: 744.74 | bwd_inner: 2.00 | bwd_allreduce: 742.60 | step: 2.06
  5%|▍         | 171/3507 [06:06<1:12:38,  1.31s/it]                                                    {'loss': 0.9824, 'learning_rate': 1.9981980085603147e-05, 'epoch': 0.05}
  5%|▍         | 171/3507 [06:06<1:12:38,  1.31s/it]tensor([[-0.8125, -0.7461,  0.1270,  0.4102, -0.7852]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-2.8281, -2.8125, -1.3984, -0.9844, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:0')
tensor([3], device='cuda:3')
[2025-11-06 17:50:52,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.80 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0156, -1.9766, -0.6758, -0.3711, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4883, -0.4023,  0.5781,  0.7344, -0.4824]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0938, -2.0469, -0.6523, -0.3867, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4219, -1.3594, -0.2373, -0.0172, -1.3672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4531, -1.3594, -0.0153, -0.0535, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4297, -1.3906, -0.3340,  0.1011, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:50:53,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:50:53,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.31 | bwd_microstep: 23.62 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 22.51 | step_microstep: 1.56
[2025-11-06 17:50:53,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.14 | bwd: 24.50 | bwd_inner: 1.82 | bwd_allreduce: 22.55 | step: 1.65
  5%|▍         | 172/3507 [06:06<57:28,  1.03s/it]                                                    {'loss': 0.9526, 'learning_rate': 1.9981421534697384e-05, 'epoch': 0.05}
  5%|▍         | 172/3507 [06:06<57:28,  1.03s/it]tensor([[-1.2422, -1.1250,  0.2168,  0.0981, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:53,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.65 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4219, -1.3750, -0.3418,  0.0552, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3516, -0.2676,  0.6367,  0.7930, -0.3477]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -2.5938, -1.0938, -0.6797, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5703, -1.5234, -0.3848, -0.0574, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7266, -1.6797, -0.5508, -0.2266, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5625, -0.4551,  0.6289,  0.6172, -0.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7344, -1.6719, -0.3477,  0.0312, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:54,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:50:54,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.43 | bwd_microstep: 1.82 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.67
[2025-11-06 17:50:54,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.10 | bwd: 2.70 | bwd_inner: 1.71 | bwd_allreduce: 0.87 | step: 1.76
  5%|▍         | 173/3507 [06:08<1:01:24,  1.11s/it]                                                    {'loss': 0.9507, 'learning_rate': 1.9980854466944572e-05, 'epoch': 0.05}
  5%|▍         | 173/3507 [06:08<1:01:24,  1.11s/it]tensor([[-2.6719, -2.5938, -0.9023, -0.7969, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:54,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.46 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.3359, -1.2266,  0.1436,  0.0164, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0703, -0.9766,  0.2871,  0.3613, -1.0234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.4141, -1.3281, -0.1216,  0.0111, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7617, -0.6406,  0.5352,  0.3145, -0.6992]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4902, -0.4277,  0.3184,  0.6523, -0.4805]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5938, -1.5000, -0.0330, -0.0664, -1.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4531, -2.4531, -1.2422, -0.8086, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:50:56,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:50:56,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.17 | bwd_microstep: 1520.14 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 1518.61 | step_microstep: 1.70
[2025-11-06 17:50:56,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.65 | bwd: 1521.27 | bwd_inner: 2.48 | bwd_allreduce: 1518.65 | step: 1.79
  5%|▍         | 174/3507 [06:10<1:15:02,  1.35s/it]                                                    {'loss': 0.9551, 'learning_rate': 1.9980278882828582e-05, 'epoch': 0.05}
  5%|▍         | 174/3507 [06:10<1:15:02,  1.35s/it]tensor([[-2.3281, -2.2344, -0.5664, -0.5938, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9180, -0.7969,  0.3750,  0.2285, -0.8555]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-0.5938, -0.4824,  0.5859,  0.5859, -0.5664]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3281, -2.2812, -0.7773, -0.5117, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:50:56,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.92 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.6680, -0.5898,  0.3281,  0.4766, -0.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7031, -2.6719, -1.2031, -0.8242, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3828, -1.3203, -0.1582,  0.1270, -1.3359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.0454, 0.1777, 1.1719, 1.0000, 0.0559]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:50:56,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:50:56,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.41 | bwd_microstep: 162.74 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 161.78 | step_microstep: 1.44
[2025-11-06 17:50:56,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.36 | bwd: 163.66 | bwd_inner: 1.71 | bwd_allreduce: 161.82 | step: 1.54
  5%|▍         | 175/3507 [06:10<1:02:32,  1.13s/it]                                                    {'loss': 1.1313, 'learning_rate': 1.9979694782840536e-05, 'epoch': 0.05}
  5%|▍         | 175/3507 [06:10<1:02:32,  1.13s/it]tensor([[-1.6094, -1.5312, -0.1099, -0.0255, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:57,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.67 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.5938, -0.4844,  0.4746,  0.4707, -0.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5000, -1.3906,  0.0762,  0.0081, -1.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4531, -1.3672, -0.0884, -0.0099, -1.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6484e+00, -1.5703e+00, -1.3672e-01,  1.5717e-03, -1.5703e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1875, -2.0781, -0.5039, -0.5898, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.7891, -0.7266,  0.2422,  0.5352, -0.7773]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2500, -1.1875, -0.0559,  0.2295, -1.2109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:50:59,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 17:50:59,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.99 | bwd_microstep: 2266.41 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 2265.40 | step_microstep: 1.87
[2025-11-06 17:50:59,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 208.67 | bwd: 2267.38 | bwd_inner: 1.79 | bwd_allreduce: 2265.45 | step: 1.95
  5%|▌         | 176/3507 [06:13<1:25:28,  1.54s/it]                                                    {'loss': 1.2202, 'learning_rate': 1.9979102167478833e-05, 'epoch': 0.05}
  5%|▌         | 176/3507 [06:13<1:25:28,  1.54s/it]tensor([[-1.8984, -1.7891, -0.2178, -0.2617, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8047, -1.7578, -0.6641, -0.1924, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6602, -0.5586,  0.4766,  0.5195, -0.6289]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:50:59,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.15 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5469, -1.4922, -0.3789,  0.0549, -1.4922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7969, -1.7031, -0.1875, -0.1875, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6953, -1.6250, -0.3672, -0.1406, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -2.7812, -1.1250, -1.1016, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.2812, -1.2266, -0.2334,  0.0967, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:50:59,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:50:59,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.64 | bwd_microstep: 66.10 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 64.95 | step_microstep: 1.49
[2025-11-06 17:50:59,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.82 | bwd: 67.00 | bwd_inner: 1.88 | bwd_allreduce: 64.99 | step: 1.57
  5%|▌         | 177/3507 [06:13<1:07:55,  1.22s/it]                                                    {'loss': 0.9507, 'learning_rate': 1.9978501037249132e-05, 'epoch': 0.05}
  5%|▌         | 177/3507 [06:13<1:07:55,  1.22s/it]tensor([[-1.5938, -1.5312, -0.2988,  0.0058, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -2.8594, -1.2969, -0.9102, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:00,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.27 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.0469, -0.9766,  0.0479,  0.3301, -1.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7383, -0.6367,  0.3496,  0.3145, -0.6914]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5391, -1.4375, -0.0156, -0.0605, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.9727, -0.8711,  0.3516,  0.5273, -0.9336]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1123, 0.2314, 1.1562, 1.1484, 0.1040]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7109, -1.6406, -0.3320, -0.0298, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:02,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:51:02,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.51 | bwd_microstep: 2014.70 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2013.65 | step_microstep: 1.95
[2025-11-06 17:51:02,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 486.81 | bwd: 2015.75 | bwd_inner: 1.93 | bwd_allreduce: 2013.69 | step: 2.04
  5%|▌         | 178/3507 [06:16<1:29:52,  1.62s/it]                                                    {'loss': 0.9414, 'learning_rate': 1.997789139266436e-05, 'epoch': 0.05}
  5%|▌         | 178/3507 [06:16<1:29:52,  1.62s/it]tensor([[-1.7891, -1.6953, -0.1680,  0.0270, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9766, -1.9219, -0.5938, -0.1807, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:02,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.74 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[3.1094, 3.2656, 3.2969, 3.1562, 2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5156, -1.4141, -0.0659,  0.1348, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3848, -0.2617,  0.8398,  0.7852, -0.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9023, -0.8320,  0.1543,  0.5234, -0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5234e+00, -1.3906e+00,  1.1063e-03, -1.2500e-01, -1.4219e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.5938, -1.5078, -0.1040,  0.0299, -1.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:02,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 17:51:02,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.46 | bwd_microstep: 30.99 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 29.68 | step_microstep: 1.48
[2025-11-06 17:51:02,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 468.23 | bwd: 32.20 | bwd_inner: 2.33 | bwd_allreduce: 29.73 | step: 1.57
  5%|▌         | 179/3507 [06:16<1:11:53,  1.30s/it]                                                    {'loss': 0.9761, 'learning_rate': 1.9977273234244707e-05, 'epoch': 0.05}
  5%|▌         | 179/3507 [06:16<1:11:53,  1.30s/it]tensor([[-1.6328, -1.5781, -0.4980, -0.1631, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0312, -1.9688, -0.5820, -0.2715, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:03,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 1.71 | bwd_inner_microstep: 1.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7812, -2.7344, -1.0547, -0.6875, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0000, -0.8633,  0.4727,  0.2598, -0.9258]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0703, -0.9102,  0.4863,  0.3320, -0.9883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.2812, -4.2500, -2.1719, -1.7344, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5469, -1.4219, -0.0388, -0.1006, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[0.4512, 0.5938, 1.4766, 1.4531, 0.4238]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:05,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 17:51:05,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.98 | bwd_microstep: 1778.79 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1777.64 | step_microstep: 2.90
[2025-11-06 17:51:05,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.58 | bwd: 1780.50 | bwd_inner: 2.69 | bwd_allreduce: 1777.69 | step: 2.98
  5%|▌         | 180/3507 [06:18<1:26:22,  1.56s/it]                                                    {'loss': 1.2876, 'learning_rate': 1.9976646562517633e-05, 'epoch': 0.05}
  5%|▌         | 180/3507 [06:18<1:26:22,  1.56s/it]tensor([[-1.2031, -1.1484, -0.1436,  0.1982, -1.1641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812, -2.2031, -0.6484, -0.4453, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3750, -1.2656,  0.0591,  0.2520, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:05,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.16 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.7031, -1.6016, -0.2139, -0.0879, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3438, -1.2109,  0.1904,  0.0225, -1.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.4375, -1.3594, -0.1494,  0.2139, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1172, -1.0312,  0.1387,  0.3750, -1.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2344, -2.1406, -0.5469, -0.4961, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:51:05,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:51:05,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.38 | bwd_microstep: 46.11 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 45.05 | step_microstep: 1.56
[2025-11-06 17:51:05,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.57 | bwd: 47.21 | bwd_inner: 1.99 | bwd_allreduce: 45.09 | step: 1.65
  5%|▌         | 181/3507 [06:19<1:08:00,  1.23s/it]                                                    {'loss': 1.1416, 'learning_rate': 1.997601137801785e-05, 'epoch': 0.05}
  5%|▌         | 181/3507 [06:19<1:08:00,  1.23s/it]tensor([[-1.3203, -1.1875,  0.3027,  0.0903, -1.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0547, -0.9648,  0.1807,  0.4395, -1.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -3.0938, -1.3125, -1.0234, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:05,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.50 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[0.0334, 0.1533, 1.0781, 1.1875, 0.0170]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5547, -1.4766, -0.1543,  0.1787, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7266, -1.6328, -0.3027, -0.1035, -1.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938, -3.5469, -1.7734, -1.2344, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6250, -1.5625, -0.4180,  0.0023, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:07,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.25
[2025-11-06 17:51:07,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.88 | bwd_microstep: 1638.33 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1637.25 | step_microstep: 2.76
[2025-11-06 17:51:07,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 505.41 | bwd: 1639.30 | bwd_inner: 1.87 | bwd_allreduce: 1637.30 | step: 2.84
  5%|▌         | 182/3507 [06:21<1:23:59,  1.52s/it]                                                    {'loss': 0.8979, 'learning_rate': 1.9975367681287358e-05, 'epoch': 0.05}
  5%|▌         | 182/3507 [06:21<1:23:59,  1.52s/it]tensor([[-1.3516, -1.1953,  0.2578,  0.1074, -1.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3672, -1.3125, -0.2559,  0.1206, -1.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[2.8125, 2.9844, 3.1875, 3.0156, 2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:07,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.98 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.4883, -0.3340,  0.8906,  0.7227, -0.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8125, -1.7188, -0.3359, -0.1826, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5234, -1.4297, -0.0444,  0.1357, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9062, -1.8359, -0.5938, -0.1729, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[2.1875, 2.3438, 2.5781, 2.5156, 2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:09,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:51:09,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.42 | bwd_microstep: 1263.83 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 1262.90 | step_microstep: 1.85
[2025-11-06 17:51:09,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.44 | bwd: 1264.65 | bwd_inner: 1.55 | bwd_allreduce: 1262.94 | step: 1.94
  5%|▌         | 183/3507 [06:23<1:26:51,  1.57s/it]                                                    {'loss': 1.0674, 'learning_rate': 1.9974715472875382e-05, 'epoch': 0.05}
  5%|▌         | 183/3507 [06:23<1:26:51,  1.57s/it]tensor([[-1.2812, -1.2031, -0.0396,  0.3887, -1.2422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0391, -0.9336,  0.2256,  0.3516, -0.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:09,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.18 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.7578, -1.6875, -0.4512, -0.0591, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3418, 0.4824, 1.2266, 1.0000, 0.3457]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.7539, -0.6484,  0.4219,  0.5312, -0.7227]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1094, -1.9688, -0.2021, -0.2227, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4688, -1.3984, -0.2021,  0.2295, -1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7969, -1.7344, -0.5469, -0.1543, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:10,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:51:10,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.61 | bwd_microstep: 968.76 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 967.55 | step_microstep: 1.98
[2025-11-06 17:51:10,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 268.81 | bwd: 969.75 | bwd_inner: 2.02 | bwd_allreduce: 967.59 | step: 2.07
  5%|▌         | 184/3507 [06:24<1:21:51,  1.48s/it]                                                    {'loss': 0.9087, 'learning_rate': 1.997405475333845e-05, 'epoch': 0.05}
  5%|▌         | 184/3507 [06:24<1:21:51,  1.48s/it]tensor([[-5.9375, -5.8125, -3.1719, -2.9375, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.7891, -1.6797, -0.2871, -0.0420, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:10,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.4375, -1.3047,  0.1309,  0.1006, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0312, -0.8672,  0.5430,  0.3379, -0.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1875, -1.0859,  0.0654,  0.2598, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1562, -2.0312, -0.4121, -0.3145, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.3691, -0.2500,  0.8516,  1.0234, -0.3652]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7656, -2.7031, -1.2969, -0.7227, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:11,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:51:11,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.17 | bwd_microstep: 560.04 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 559.00 | step_microstep: 1.71
[2025-11-06 17:51:11,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.84 | bwd: 561.00 | bwd_inner: 1.81 | bwd_allreduce: 559.04 | step: 1.80
  5%|▌         | 185/3507 [06:25<1:13:26,  1.33s/it]                                                    {'loss': 1.2319, 'learning_rate': 1.9973385523240325e-05, 'epoch': 0.05}
  5%|▌         | 185/3507 [06:25<1:13:26,  1.33s/it]tensor([[-3.3281, -3.2344, -1.3906, -1.0156, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2969, -2.2031, -0.6328, -0.3750, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -2.8906, -1.1172, -0.9453, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:11,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6406, -2.5469, -1.0312, -0.7539, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2500, -2.0938, -0.4102, -0.5195, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9297, -1.8281, -0.3203, -0.0057, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8438, -1.7266, -0.2656, -0.0669, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9844, -0.8672,  0.3691,  0.4961, -0.9414]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:13,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:51:13,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.86 | bwd_microstep: 968.58 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 967.51 | step_microstep: 1.69
[2025-11-06 17:51:13,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.31 | bwd: 969.43 | bwd_inner: 1.77 | bwd_allreduce: 967.54 | step: 1.77
  5%|▌         | 186/3507 [06:26<1:14:25,  1.34s/it]                                                    {'loss': 0.8501, 'learning_rate': 1.9972707783152042e-05, 'epoch': 0.05}
  5%|▌         | 186/3507 [06:26<1:14:25,  1.34s/it]tensor([[-0.9922, -0.8984,  0.1260,  0.5664, -0.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3906, -1.2891, -0.0181,  0.3066, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7656, -1.6328, -0.0344,  0.0500, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:13,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.47 | bwd_microstep: 1.20 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.4590, -0.2793,  0.9570,  0.7305, -0.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.4375, -2.3438, -0.7578, -0.4121, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.0078, -0.8320,  0.5039,  0.2539, -0.9258]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1719, -2.0625, -0.4199, -0.2695, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7266, -1.5859,  0.0064,  0.0427, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:51:15,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:51:15,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.80 | bwd_microstep: 1589.82 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1588.75 | step_microstep: 2.53
[2025-11-06 17:51:15,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.30 | bwd: 1591.02 | bwd_inner: 2.11 | bwd_allreduce: 1588.79 | step: 2.60
  5%|▌         | 187/3507 [06:28<1:25:37,  1.55s/it]                                                    {'loss': 1.0835, 'learning_rate': 1.99720215336519e-05, 'epoch': 0.05}
  5%|▌         | 187/3507 [06:28<1:25:37,  1.55s/it]tensor([[-1.4609, -1.3594, -0.1089,  0.1416, -1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8047, -0.6172,  0.6367,  0.4297, -0.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.5938, -1.4922, -0.0938,  0.2715, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6836, -0.5664,  0.4316,  0.5586, -0.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2578, -1.1328,  0.0123,  0.0452, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:15,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 401.61 | bwd_microstep: 1.20 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-3.0469, -2.9531, -1.2500, -0.9258, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1875, -2.0938, -0.6992, -0.2021, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.0625, -3.9219, -1.6406, -1.5234, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:16,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:51:16,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.64 | bwd_microstep: 689.54 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 688.44 | step_microstep: 1.74
[2025-11-06 17:51:16,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 509.28 | bwd: 690.73 | bwd_inner: 2.05 | bwd_allreduce: 688.51 | step: 1.86
  5%|▌         | 188/3507 [06:30<1:20:40,  1.46s/it]                                                    {'loss': 1.3188, 'learning_rate': 1.9971326775325453e-05, 'epoch': 0.05}
  5%|▌         | 188/3507 [06:30<1:20:40,  1.46s/it]tensor([[-0.6016, -0.4746,  0.5898,  0.7422, -0.5742]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:16,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.57 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.8125, -1.6875, -0.1787,  0.0679, -1.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7188e+00, -1.5703e+00,  2.8992e-03,  3.0518e-05, -1.6250e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2500, -2.1719, -0.8750, -0.3652, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.4609, -1.3047,  0.2422,  0.2041, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1484, -1.0547, -0.0381,  0.4277, -1.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-1.3125, -1.2188, -0.0830,  0.3594, -1.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6484, -1.5625, -0.3359,  0.1094, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:18,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.16 | optimizer_step: 0.24
[2025-11-06 17:51:18,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.79 | bwd_microstep: 1490.77 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 1489.49 | step_microstep: 2.17
[2025-11-06 17:51:18,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.39 | bwd: 1491.65 | bwd_inner: 1.94 | bwd_allreduce: 1489.53 | step: 2.27
  5%|▌         | 189/3507 [06:32<1:28:39,  1.60s/it]                                                    {'loss': 1.2847, 'learning_rate': 1.9970623508765516e-05, 'epoch': 0.05}
  5%|▌         | 189/3507 [06:32<1:28:39,  1.60s/it]tensor([[0.1133, 0.3027, 1.3516, 1.2109, 0.1191]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6016, -0.4844,  0.5391,  0.6211, -0.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0000, -3.9062, -1.9453, -1.4609, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:18,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.85 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.3750, -1.2734, -0.0811,  0.2080, -1.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2422, -1.1172,  0.1475,  0.3320, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8672, -1.7500, -0.3535,  0.0532, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -4.3125, -2.2656, -1.7422, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0078, -0.8594,  0.3965,  0.3301, -0.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:18,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:51:18,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.11 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.12
[2025-11-06 17:51:18,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 460.98 | bwd: 2.55 | bwd_inner: 1.63 | bwd_allreduce: 0.81 | step: 2.19
  5%|▌         | 190/3507 [06:32<1:10:25,  1.27s/it]                                                    {'loss': 0.8911, 'learning_rate': 1.9969911734572166e-05, 'epoch': 0.05}
  5%|▌         | 190/3507 [06:32<1:10:25,  1.27s/it]tensor([[-1.8438, -1.7344, -0.4316,  0.0339, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7773, -0.6641,  0.4199,  0.7500, -0.7539]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8398, -0.7344,  0.2578,  0.6875, -0.8086]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.0312, -0.8867,  0.4609,  0.5312, -0.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:19,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.81 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.7031, -2.6094, -1.1406, -0.5469, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0156, -0.8438,  0.4297,  0.2598, -0.9336]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6562, -1.5469, -0.3105,  0.1777, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0938, -1.9688, -0.4688, -0.1064, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:20,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:51:20,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.83 | bwd_microstep: 965.75 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 964.51 | step_microstep: 2.64
[2025-11-06 17:51:20,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.67 | bwd: 966.63 | bwd_inner: 1.94 | bwd_allreduce: 964.55 | step: 2.74
  5%|▌         | 191/3507 [06:34<1:12:31,  1.31s/it]                                                    {'loss': 1.0166, 'learning_rate': 1.996919145335274e-05, 'epoch': 0.05}
  5%|▌         | 191/3507 [06:34<1:12:31,  1.31s/it]tensor([[-1.6406, -1.4531,  0.1816,  0.0942, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:20,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.12 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8125, -1.6562, -0.0302,  0.0282, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031, -3.0469, -1.1875, -1.0312, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8828, -1.7422, -0.3223, -0.1650, -1.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062, -2.2500, -0.5703, -0.4590, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[2.2031, 2.3594, 2.6719, 2.6875, 2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0469, -1.9297, -0.4551, -0.0708, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7734e+00, -1.6875e+00, -5.1172e-01,  1.0757e-03, -1.6953e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:20,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:51:20,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.83 | bwd_microstep: 76.51 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 75.21 | step_microstep: 1.41
[2025-11-06 17:51:20,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.97 | bwd: 77.47 | bwd_inner: 2.12 | bwd_allreduce: 75.24 | step: 1.49
  5%|▌         | 192/3507 [06:34<57:36,  1.04s/it]                                                    {'loss': 0.9434, 'learning_rate': 1.9968462665721828e-05, 'epoch': 0.05}
  5%|▌         | 192/3507 [06:34<57:36,  1.04s/it]tensor([[-1.0078, -0.8594,  0.4453,  0.6367, -0.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3672, -1.2656, -0.1118,  0.3750, -1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6953, -0.5625,  0.5039,  0.8438, -0.6523]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.1875, -2.0469, -0.3008, -0.2002, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6016, -0.4570,  0.6914,  0.7773, -0.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5781, -0.4395,  0.7422,  0.9805, -0.5586]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7891, -0.6445,  0.5508,  0.6445, -0.7461]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:21,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.41 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[4.4375, 4.6250, 4.4062, 4.2188, 4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:21,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:51:21,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.08 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.10
[2025-11-06 17:51:21,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.52 | bwd: 2.96 | bwd_inner: 1.95 | bwd_allreduce: 0.88 | step: 2.18
  6%|▌         | 193/3507 [06:35<57:23,  1.04s/it]                                                  {'loss': 1.2598, 'learning_rate': 1.9967725372301287e-05, 'epoch': 0.06}
  6%|▌         | 193/3507 [06:35<57:23,  1.04s/it]tensor([[-1.8750, -1.7500, -0.3887,  0.0212, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3047, -1.1797,  0.1045,  0.4277, -1.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:21,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.27 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8438, -1.6953, -0.0859,  0.1021, -1.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0625, -1.9297, -0.4160, -0.0229, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3125, 0.4766, 1.4375, 1.5078, 0.2930]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4062, -1.2266,  0.1426,  0.1338, -1.3047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2969, -2.1094, -0.2832, -0.3418, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6328, -0.5117,  0.5156,  0.9180, -0.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:51:23,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:51:23,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.94 | bwd_microstep: 1675.53 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1674.57 | step_microstep: 2.30
[2025-11-06 17:51:23,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.24 | bwd: 1676.45 | bwd_inner: 1.72 | bwd_allreduce: 1674.61 | step: 2.37
  6%|▌         | 194/3507 [06:37<1:14:32,  1.35s/it]                                                    {'loss': 0.9668, 'learning_rate': 1.996697957372023e-05, 'epoch': 0.06}
  6%|▌         | 194/3507 [06:37<1:14:32,  1.35s/it]tensor([[-0.8633, -0.6602,  0.6758,  0.5586, -0.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[0.6953, 0.8398, 1.6094, 1.6250, 0.6523]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:23,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.06 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-0.8438, -0.6406,  0.6719,  0.5391, -0.7695]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969, -2.1719, -0.7305, -0.4258, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2656, -1.1094,  0.2256,  0.4316, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0312, -1.8125, -0.0942, -0.2715, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4609, -1.2422,  0.2539,  0.0525, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5391, -1.3906,  0.1221,  0.3359, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:24,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:51:24,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.84 | bwd_microstep: 884.19 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 883.05 | step_microstep: 1.76
[2025-11-06 17:51:24,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 283.92 | bwd: 885.26 | bwd_inner: 2.00 | bwd_allreduce: 883.11 | step: 1.86
  6%|▌         | 195/3507 [06:38<1:12:10,  1.31s/it]                                                    {'loss': 1.1123, 'learning_rate': 1.9966225270615016e-05, 'epoch': 0.06}
  6%|▌         | 195/3507 [06:38<1:12:10,  1.31s/it]tensor([[-2.9062, -2.6875, -0.7539, -0.8633, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.5156, -1.3359,  0.2617,  0.3125, -1.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:25,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.76 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.6211, -0.4570,  0.8047,  0.8086, -0.5742]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5312, -1.3906, -0.0322,  0.2188, -1.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9492, -0.8281,  0.2832,  0.6641, -0.9023]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.8750, -2.7500, -2.1719, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7656, -2.6094, -0.7070, -0.3652, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8867, -0.6914,  0.7930,  0.7344, -0.8086]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:51:25,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:51:25,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.15 | bwd_microstep: 335.37 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 334.25 | step_microstep: 1.46
[2025-11-06 17:51:25,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.94 | bwd: 336.36 | bwd_inner: 1.95 | bwd_allreduce: 334.29 | step: 1.54
  6%|▌         | 196/3507 [06:39<1:03:41,  1.15s/it]                                                    {'loss': 1.106, 'learning_rate': 1.9965462463629274e-05, 'epoch': 0.06}
  6%|▌         | 196/3507 [06:39<1:03:41,  1.15s/it]tensor([[-2.0469, -1.9219, -0.4316,  0.0066, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -3.2344, -1.1172, -0.7383, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2109, -1.0469,  0.2871,  0.4766, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625, -2.9375, -1.2500, -0.8906, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1016, -0.9062,  0.6523,  0.6133, -1.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.2656, -0.0356,  1.2266,  0.9297, -0.2051]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:26,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.10 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.1016, -0.8984,  0.5703,  0.4961, -1.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4531, -2.3438, -0.8789, -0.3848, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:27,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 17:51:27,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.32 | bwd_microstep: 337.71 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 336.31 | step_microstep: 2.15
[2025-11-06 17:51:27,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.44 | bwd: 338.65 | bwd_inner: 2.14 | bwd_allreduce: 336.36 | step: 2.22
  6%|▌         | 197/3507 [06:41<1:10:59,  1.29s/it]                                                    {'loss': 0.8794, 'learning_rate': 1.9964691153413883e-05, 'epoch': 0.06}
  6%|▌         | 197/3507 [06:41<1:10:59,  1.29s/it]tensor([[-1.4453, -1.3203, -0.0811,  0.3145, -1.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:27,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.09 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-0.8594, -0.6953,  0.5039,  0.5586, -0.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8438, -2.7188, -0.9531, -0.5742, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2969, -2.1250, -0.4434, -0.1787, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2500, -1.1172,  0.0864,  0.5273, -1.1641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7656, -1.5547,  0.1348, -0.0106, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9375, -1.7891, -0.2217,  0.1289, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4766, -1.2656,  0.3320,  0.1396, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:51:28,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:51:28,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.72 | bwd_microstep: 1047.56 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1046.43 | step_microstep: 1.93
[2025-11-06 17:51:28,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.82 | bwd: 1048.65 | bwd_inner: 2.00 | bwd_allreduce: 1046.49 | step: 2.04
  6%|▌         | 198/3507 [06:42<1:13:25,  1.33s/it]                                                    {'loss': 0.8525, 'learning_rate': 1.9963911340626982e-05, 'epoch': 0.06}
  6%|▌         | 198/3507 [06:42<1:13:25,  1.33s/it]tensor([[-1.7734, -1.6094, -0.0869,  0.1787, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:28,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.74 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6250, -3.4219, -1.3516, -1.2188, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9766, -0.8047,  0.5234,  0.5664, -0.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -2.5469, -0.6758, -0.3789, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1040, 0.3301, 1.5078, 1.1484, 0.1416]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7344, -1.5781, -0.0649,  0.2031, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1709, -0.0089,  1.0938,  1.2266, -0.1553]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9453e+00, -1.8281e+00, -4.9609e-01, -1.4496e-04, -1.8281e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:29,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:51:29,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.16 | bwd_microstep: 20.57 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 19.22 | step_microstep: 1.52
[2025-11-06 17:51:29,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.91 | bwd: 21.64 | bwd_inner: 2.27 | bwd_allreduce: 19.25 | step: 1.60
  6%|▌         | 199/3507 [06:42<58:13,  1.06s/it]                                                    {'loss': 0.8574, 'learning_rate': 1.996312302593396e-05, 'epoch': 0.06}
  6%|▌         | 199/3507 [06:42<58:13,  1.06s/it]tensor([[-2.0156, -1.8672, -0.4531, -0.1943, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:29,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.82 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.9531, -0.7812,  0.4883,  0.4609, -0.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.0000, -0.8320,  0.5938,  0.8086, -0.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4453, -1.3203, -0.0933,  0.4160, -1.3516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3750, -2.1875, -0.3301, -0.1543, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2266, -1.0859,  0.2002,  0.5781, -1.1484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3672, -1.1328,  0.5078,  0.3184, -1.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7773, -0.5703,  0.7891,  0.7773, -0.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:51:31,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.31
[2025-11-06 17:51:31,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.05 | bwd_microstep: 1976.55 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 1975.16 | step_microstep: 2.44
[2025-11-06 17:51:31,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.88 | bwd: 1977.50 | bwd_inner: 2.16 | bwd_allreduce: 1975.21 | step: 2.52
  6%|▌         | 200/3507 [06:45<1:18:46,  1.43s/it]                                                    {'loss': 0.8584, 'learning_rate': 1.9962326210007462e-05, 'epoch': 0.06}
  6%|▌         | 200/3507 [06:45<1:18:46,  1.43s/it]tensor([[-1.3750, -1.2188,  0.1216,  0.4668, -1.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9609, -1.7812, -0.1426,  0.0718, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9375, -2.7344, -0.6055, -0.6250, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6875, -1.5703, -0.3242,  0.1836, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7812, -0.5625,  0.9375,  0.7734, -0.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3672, -1.1953,  0.3887,  0.5352, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7969, -0.6367,  0.5781,  0.8281, -0.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:32,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.06 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[0.8789, 1.0156, 1.6797, 1.9141, 0.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:32,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:51:32,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.03 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.71
[2025-11-06 17:51:32,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.12 | bwd: 2.87 | bwd_inner: 1.92 | bwd_allreduce: 0.83 | step: 1.81
  6%|▌         | 201/3507 [06:46<1:09:30,  1.26s/it]                                                    {'loss': 0.8877, 'learning_rate': 1.9961520893527385e-05, 'epoch': 0.06}
  6%|▌         | 201/3507 [06:46<1:09:30,  1.26s/it]tensor([[-1.3203, -1.1328,  0.4707,  0.5430, -1.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062, -2.7656, -1.0469, -0.6133, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:32,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.52 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.1172,  0.1196,  1.3438,  1.0781, -0.0659]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.4062, -3.2031, -1.0391, -0.8828, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5312, -2.3438, -0.4805, -0.2168, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6328, -1.3906,  0.2041,  0.0062, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2969, -1.0703,  0.4238,  0.3770, -1.1797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250, -2.4844, -0.8789, -0.3418, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:34,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 17:51:34,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.06 | bwd_microstep: 1424.93 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1423.70 | step_microstep: 2.56
[2025-11-06 17:51:34,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.60 | bwd: 1425.94 | bwd_inner: 2.06 | bwd_allreduce: 1423.75 | step: 2.65
  6%|▌         | 202/3507 [06:47<1:18:36,  1.43s/it]                                                    {'loss': 0.9624, 'learning_rate': 1.9960707077180883e-05, 'epoch': 0.06}
  6%|▌         | 202/3507 [06:47<1:18:36,  1.43s/it]tensor([[-2.1719, -2.0156, -0.4570, -0.1631, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4219, -2.2656, -0.5547, -0.1377, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:34,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.01 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8516, -1.6797, -0.1592,  0.0361, -1.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6953, -0.4941,  0.9062,  0.8242, -0.6211]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2031, -1.0625,  0.2490,  0.7070, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -2.8906, -1.0703, -0.6523, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0781, -0.8867,  0.6133,  0.6406, -0.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1719, -1.9766, -0.1465, -0.1079, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:34,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:51:34,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.42 | bwd_microstep: 87.22 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 86.09 | step_microstep: 1.49
[2025-11-06 17:51:34,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.46 | bwd: 88.11 | bwd_inner: 1.86 | bwd_allreduce: 86.13 | step: 1.56
  6%|▌         | 203/3507 [06:48<1:02:40,  1.14s/it]                                                    {'loss': 0.9126, 'learning_rate': 1.995988476166236e-05, 'epoch': 0.06}
  6%|▌         | 203/3507 [06:48<1:02:40,  1.14s/it]tensor([[-2.0938, -1.9219, -0.3125, -0.0422, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:34,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.89 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.4766, -1.2734,  0.3652,  0.3086, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.5156, -0.3359,  0.8867,  0.9883, -0.4648]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1406, -1.9844, -0.4453, -0.0464, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5430, -0.3906,  0.7188,  1.0391, -0.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2656, -2.0781, -0.4121, -0.2207, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0156, -1.8281, -0.1543, -0.0352, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5938, -1.3594,  0.2754,  0.1279, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:51:36,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.19 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 17:51:36,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.58 | bwd_microstep: 1484.11 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1482.85 | step_microstep: 3.63
[2025-11-06 17:51:36,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 205.50 | bwd: 1485.14 | bwd_inner: 2.10 | bwd_allreduce: 1482.90 | step: 3.72
  6%|▌         | 204/3507 [06:50<1:12:14,  1.31s/it]                                                    {'loss': 0.9019, 'learning_rate': 1.995905394767348e-05, 'epoch': 0.06}
  6%|▌         | 204/3507 [06:50<1:12:14,  1.31s/it]tensor([[-1.2031, -1.0703,  0.0889,  0.5469, -1.1172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6328, -1.4062,  0.2383,  0.1846, -1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2266, -1.0859,  0.1572,  0.5469, -1.1484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000, -2.2969, -0.4531, -0.3965, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.3652, 0.4941, 1.2344, 1.4844, 0.3496]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:36,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.37 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-2.4844, -2.3438, -0.8789, -0.3066, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1484, -0.9375,  0.4395,  0.4023, -1.0391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5469, -2.3594, -0.4062, -0.1836, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:37,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:51:37,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.71 | bwd_microstep: 1.71 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.05
[2025-11-06 17:51:37,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 501.12 | bwd: 2.75 | bwd_inner: 1.78 | bwd_allreduce: 0.83 | step: 2.15
  6%|▌         | 205/3507 [06:50<1:01:37,  1.12s/it]                                                    {'loss': 0.8535, 'learning_rate': 1.9958214635923144e-05, 'epoch': 0.06}
  6%|▌         | 205/3507 [06:50<1:01:37,  1.12s/it]tensor([[-0.2773, -0.1221,  0.9258,  1.1719, -0.2461]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:37,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.87 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.4453, -1.3125, -0.0225,  0.4180, -1.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.2344, -1.0469,  0.4043,  0.3379, -1.1328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -2.1562, -0.6719, -0.0718, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3496, -0.1191,  1.2422,  0.8906, -0.2832]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6406, -1.5000, -0.1191,  0.3926, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0625, -1.8672, -0.2100, -0.0564, -1.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6719, -0.5195,  0.6797,  1.0234, -0.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:51:39,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 17:51:39,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.23 | bwd_microstep: 2313.25 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 2312.20 | step_microstep: 78.57
[2025-11-06 17:51:39,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.13 | bwd: 2314.18 | bwd_inner: 1.80 | bwd_allreduce: 2312.25 | step: 78.66
  6%|▌         | 206/3507 [06:53<1:28:42,  1.61s/it]                                                    {'loss': 0.875, 'learning_rate': 1.995736682712751e-05, 'epoch': 0.06}
  6%|▌         | 206/3507 [06:53<1:28:42,  1.61s/it]tensor([[-1.3125, -1.1016,  0.5703,  0.6992, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4688, -2.2500, -0.3398, -0.1152, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:39,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.85 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.3672, -1.2109,  0.1494,  0.5273, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4375, -1.2734,  0.1729,  0.4980, -1.3359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -3.2031, -1.0781, -0.9219, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3516, -1.1250,  0.5625,  0.4121, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.7969, -0.6328,  0.6172,  0.8711, -0.7383]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5703, -1.3438,  0.3926,  0.3555, -1.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:51:40,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:51:40,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.43 | bwd_microstep: 84.06 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 82.85 | step_microstep: 1.53
[2025-11-06 17:51:40,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.31 | bwd: 84.97 | bwd_inner: 1.96 | bwd_allreduce: 82.89 | step: 1.62
  6%|▌         | 207/3507 [06:54<1:10:13,  1.28s/it]                                                    {'loss': 0.854, 'learning_rate': 1.9956510522009992e-05, 'epoch': 0.06}
  6%|▌         | 207/3507 [06:54<1:10:13,  1.28s/it]tensor([[-2.9062, -2.7344, -0.8828, -0.5703, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:40,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.40 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.4219, -1.2812,  0.0371,  0.5039, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3672, -1.1406,  0.5781,  0.4727, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8594, -1.6172,  0.1133, -0.1035, -1.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8047, -0.5547,  0.9414,  0.6680, -0.7070]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2188, -2.0312, -0.1553, -0.0097, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.7148, 0.8633, 1.6250, 1.8906, 0.6758]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3125, -1.1562,  0.1523,  0.5508, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:41,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:51:41,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.25 | bwd_microstep: 629.82 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 628.53 | step_microstep: 1.77
[2025-11-06 17:51:41,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.68 | bwd: 630.77 | bwd_inner: 2.08 | bwd_allreduce: 628.56 | step: 1.84
  6%|▌         | 208/3507 [06:55<1:07:43,  1.23s/it]                                                    {'loss': 0.8257, 'learning_rate': 1.9955645721301252e-05, 'epoch': 0.06}
  6%|▌         | 208/3507 [06:55<1:07:43,  1.23s/it]tensor([[-1.3672, -1.1328,  0.4355,  0.3066, -1.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.7656, -1.6172, -0.2158,  0.3066, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5000, -1.3125,  0.3223,  0.5508, -1.3984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:41,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.88 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.5625, -1.4062, -0.0508,  0.2041, -1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8125, -2.5781, -0.4980, -0.4727, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8828, -1.7500, -0.3184,  0.1396, -1.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2734, -1.1016,  0.2480,  0.5234, -1.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6562, -3.4844, -1.4062, -0.9023, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:51:42,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:51:42,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.58 | bwd_microstep: 1114.78 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 1113.31 | step_microstep: 1.75
[2025-11-06 17:51:42,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.49 | bwd: 1115.70 | bwd_inner: 2.21 | bwd_allreduce: 1113.35 | step: 1.82
  6%|▌         | 209/3507 [06:56<1:12:33,  1.32s/it]                                                    {'loss': 1.0684, 'learning_rate': 1.9954772425739194e-05, 'epoch': 0.06}
  6%|▌         | 209/3507 [06:56<1:12:33,  1.32s/it]tensor([[-2.0469, -1.9141, -0.4824,  0.0302, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3203, -1.1094,  0.3770,  0.4238, -1.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:43,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.72 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.2344, -2.0312, -0.1426,  0.0515, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8516, -1.6016,  0.2168, -0.0603, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -2.7812, -1.2344, -0.6016, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1719, -1.0234,  0.2109,  0.6172, -1.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7969, -1.5625,  0.1787,  0.1348, -1.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7656, -1.6328, -0.2676,  0.2715, -1.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:43,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 17:51:43,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.63 | bwd_microstep: 143.72 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 142.55 | step_microstep: 1.91
[2025-11-06 17:51:43,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.36 | bwd: 144.73 | bwd_inner: 2.00 | bwd_allreduce: 142.59 | step: 2.00
  6%|▌         | 210/3507 [06:57<1:00:18,  1.10s/it]                                                    {'loss': 0.8047, 'learning_rate': 1.9953890636068975e-05, 'epoch': 0.06}
  6%|▌         | 210/3507 [06:57<1:00:18,  1.10s/it]tensor([[-1.2266, -1.0781,  0.1611,  0.6602, -1.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -3.5938, -1.4141, -1.1328, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6562, -1.4766,  0.0928,  0.4824, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0938, -0.9414,  0.3027,  0.7227, -1.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9141, -0.7422,  0.6172,  0.9609, -0.8555]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7109, -1.5000,  0.3105,  0.5547, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:44,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.35 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2812, -2.1250, -0.5117, -0.0048, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5156, -2.3125, -0.2930, -0.0581, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:45,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 17:51:45,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.00 | bwd_microstep: 1325.69 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 1324.14 | step_microstep: 1.97
[2025-11-06 17:51:45,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.39 | bwd: 1326.67 | bwd_inner: 2.35 | bwd_allreduce: 1324.18 | step: 2.06
  6%|▌         | 211/3507 [06:59<1:19:35,  1.45s/it]                                                    {'loss': 0.7715, 'learning_rate': 1.9953000353043e-05, 'epoch': 0.06}
  6%|▌         | 211/3507 [06:59<1:19:35,  1.45s/it]tensor([[-0.3613, -0.1230,  1.2578,  1.1094, -0.3008]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:45,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.22 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8281, -1.6172,  0.0859,  0.3906, -1.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7344, -1.4531,  0.3711,  0.1108, -1.5703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0781, -1.8906, -0.0674,  0.2461, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.0591, 0.3086, 1.6797, 1.4375, 0.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7812, -1.5078,  0.2930,  0.0386, -1.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5781, -2.4062, -0.4980, -0.1030, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1719, -0.9453,  0.6172,  0.4492, -1.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:51:46,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:51:46,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.82 | bwd_microstep: 305.93 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 304.46 | step_microstep: 1.42
[2025-11-06 17:51:46,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.07 | bwd: 306.88 | bwd_inner: 2.27 | bwd_allreduce: 304.49 | step: 1.50
  6%|▌         | 212/3507 [07:00<1:06:43,  1.22s/it]                                                    {'loss': 0.9097, 'learning_rate': 1.9952101577420925e-05, 'epoch': 0.06}
  6%|▌         | 212/3507 [07:00<1:06:43,  1.22s/it]tensor([[-1.3359, -1.1172,  0.4043,  0.2871, -1.2109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -5.7500, -3.0781, -2.3125, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0625, -1.7891,  0.2041,  0.0479, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.94 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-0.8047, -0.6289,  0.7422,  1.0781, -0.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1406, -1.9141, -0.0270, -0.0077, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6406, -2.4531, -0.6484, -0.2637, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1094, -1.9141, -0.1641,  0.1680, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6562, -1.4922,  0.0148,  0.5117, -1.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:47,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:51:47,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.26 | bwd_microstep: 588.51 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 587.12 | step_microstep: 2.30
[2025-11-06 17:51:47,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.24 | bwd: 589.32 | bwd_inner: 2.05 | bwd_allreduce: 587.15 | step: 2.37
  6%|▌         | 213/3507 [07:01<1:01:31,  1.12s/it]                                                    {'loss': 0.8857, 'learning_rate': 1.995119430996964e-05, 'epoch': 0.06}
  6%|▌         | 213/3507 [07:01<1:01:31,  1.12s/it]tensor([[-1.5547, -1.4062, -0.0540,  0.4531, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1016, -0.9492,  0.3555,  0.8086, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:47,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.57 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-2.8906, -2.6875, -0.5781, -0.4668, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3281, -2.1094, -0.1650, -0.1196, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5625, -0.2715,  1.2344,  0.8633, -0.4668]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6641, -1.4922,  0.0282,  0.4043, -1.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0703, -0.8398,  0.7070,  0.5039, -0.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -2.2031, -0.3926, -0.0659, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:51:49,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 17:51:49,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.61 | bwd_microstep: 1539.26 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1537.93 | step_microstep: 2.09
[2025-11-06 17:51:49,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.20 | bwd: 1540.41 | bwd_inner: 2.26 | bwd_allreduce: 1537.98 | step: 2.20
  6%|▌         | 214/3507 [07:03<1:15:39,  1.38s/it]                                                    {'loss': 0.7871, 'learning_rate': 1.9950278551463298e-05, 'epoch': 0.06}
  6%|▌         | 214/3507 [07:03<1:15:39,  1.38s/it]tensor([[-3.3750, -3.1406, -1.0859, -0.8906, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.7031, -1.4766,  0.3359,  0.2891, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:49,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.21 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-1.9453, -1.7734, -0.1196,  0.3438, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219, -3.2031, -0.8867, -0.7500, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0938e+00, -1.8281e+00, -6.7902e-04, -1.6211e-01, -1.9062e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7891, -1.5547,  0.2793,  0.1719, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9922, -1.8359, -0.2793,  0.2578, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8906, -0.7500,  0.3965,  0.8438, -0.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:51:49,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:51:49,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.99 | bwd_microstep: 230.26 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 229.17 | step_microstep: 2.08
[2025-11-06 17:51:49,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.22 | bwd: 231.14 | bwd_inner: 1.71 | bwd_allreduce: 229.24 | step: 2.21
  6%|▌         | 215/3507 [07:03<1:02:24,  1.14s/it]                                                    {'loss': 1.0703, 'learning_rate': 1.994935430268328e-05, 'epoch': 0.06}
  6%|▌         | 215/3507 [07:03<1:02:24,  1.14s/it]tensor([[-1.2422, -1.0781,  0.3398,  0.8398, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:50,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.36 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.3379, -0.1582,  1.1016,  1.3828, -0.3145]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3594, -1.1953,  0.2207,  0.6875, -1.2734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -3.6094, -1.1641, -0.8984, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6719, -1.4844,  0.0503,  0.2812, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3594, -1.1953,  0.1992,  0.7695, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.4375, 0.5938, 1.3984, 1.6562, 0.4258]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6719, -1.4297,  0.4023,  0.2812, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:50,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 17:51:50,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.44 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.06
[2025-11-06 17:51:50,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.83 | bwd: 2.85 | bwd_inner: 1.89 | bwd_allreduce: 0.83 | step: 2.13
  6%|▌         | 216/3507 [07:04<51:34,  1.06it/s]                                                    {'loss': 0.8794, 'learning_rate': 1.9948421564418227e-05, 'epoch': 0.06}
  6%|▌         | 216/3507 [07:04<51:34,  1.06it/s]tensor([[-2.1719, -1.8906,  0.0698, -0.0287, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5000, -1.3516, -0.0405,  0.4941, -1.3984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:50,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.82 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0000, -1.7812,  0.2148,  0.2432, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9805, -0.8398,  0.3359,  0.7109, -0.9180]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5781, -2.2969, -0.2051, -0.2471, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0000, -1.8125, -0.0786,  0.2197, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1406, -1.8984,  0.1797,  0.3359, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -2.8750, -0.7461, -0.6094, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:51:52,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:51:52,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.38 | bwd_microstep: 2180.44 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2179.36 | step_microstep: 1.66
[2025-11-06 17:51:52,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.22 | bwd: 2181.28 | bwd_inner: 1.76 | bwd_allreduce: 2179.40 | step: 1.74
  6%|▌         | 217/3507 [07:06<1:18:36,  1.43s/it]                                                    {'loss': 0.875, 'learning_rate': 1.994748033746401e-05, 'epoch': 0.06}
  6%|▌         | 217/3507 [07:06<1:18:36,  1.43s/it]tensor([[-0.9922, -0.8281,  0.5273,  0.9805, -0.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4844, -1.2344,  0.5312,  0.5625, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7344, -1.4922,  0.4180,  0.3418, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2031, -0.9414,  0.8789,  0.7539, -1.0859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:53,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.33 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-1.3203, -1.0469,  0.7109,  0.5508, -1.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8594, -1.6953, -0.1699,  0.3027, -1.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0625, -1.8672, -0.1836,  0.1709, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -3.3906, -1.0938, -0.7852, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:51:53,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:51:53,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.55 | bwd_microstep: 74.65 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 73.71 | step_microstep: 1.76
[2025-11-06 17:51:53,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.90 | bwd: 75.40 | bwd_inner: 1.54 | bwd_allreduce: 73.74 | step: 1.82
  6%|▌         | 218/3507 [07:07<1:03:56,  1.17s/it]                                                    {'loss': 0.8345, 'learning_rate': 1.9946530622623753e-05, 'epoch': 0.06}
  6%|▌         | 218/3507 [07:07<1:03:56,  1.17s/it]tensor([[0.9766, 1.1875, 2.3125, 2.3594, 0.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2451, 0.4336, 1.5000, 1.5781, 0.2490]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:53,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 1.13 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.5469, -1.2969,  0.5859,  0.5938, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0859, -0.8477,  0.8438,  0.7383, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0938, -1.9062, -0.1768,  0.1973, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0469, -1.7891,  0.1553, -0.0304, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0469, -1.8828, -0.2734,  0.3359, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0469, -1.7656,  0.2598,  0.0146, -1.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:51:55,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.33 | optimizer_step: 0.43
[2025-11-06 17:51:55,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.11 | bwd_microstep: 1985.29 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1984.07 | step_microstep: 3.34
[2025-11-06 17:51:55,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.93 | bwd: 1986.43 | bwd_inner: 2.12 | bwd_allreduce: 1984.14 | step: 3.44
  6%|▌         | 219/3507 [07:09<1:23:45,  1.53s/it]                                                    {'loss': 0.8315, 'learning_rate': 1.9945572420707825e-05, 'epoch': 0.06}
  6%|▌         | 219/3507 [07:09<1:23:45,  1.53s/it]tensor([[-2.1406, -1.8516,  0.1777, -0.0762, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6094, -2.4219, -0.5664, -0.0435, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:56,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.43 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.6641, -1.5156, -0.0491,  0.4180, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6016, -1.3359,  0.4844,  0.4961, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.3281, -1.1797,  0.1035,  0.5430, -1.2422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1094, -1.9609, -0.3730,  0.2432, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625, -2.3594, -0.4492, -0.1797, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3203, -1.0938,  0.5625,  0.5664, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:51:56,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:51:56,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.77 | bwd_microstep: 6.06 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 5.03 | step_microstep: 1.66
[2025-11-06 17:51:56,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.23 | bwd: 6.90 | bwd_inner: 1.72 | bwd_allreduce: 5.06 | step: 1.75
  6%|▋         | 220/3507 [07:10<1:05:00,  1.19s/it]                                                    {'loss': 0.979, 'learning_rate': 1.994460573253382e-05, 'epoch': 0.06}
  6%|▋         | 220/3507 [07:10<1:05:00,  1.19s/it]tensor([[-1.0469, -0.8555,  0.7227,  0.9609, -0.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:56,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.59 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.2188, -0.9492,  0.7461,  0.4668, -1.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5156, -1.3203,  0.3242,  0.6562, -1.4141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0156, -1.7734,  0.2012,  0.2734, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.5391, -1.2500,  0.6445,  0.5352, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7344, -1.5234,  0.1992,  0.4668, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9453, -1.7422,  0.0481,  0.3184, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2559,  0.0116,  1.5703,  1.3125, -0.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:51:57,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:51:57,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.95 | bwd_microstep: 1007.85 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1006.79 | step_microstep: 1.84
[2025-11-06 17:51:57,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.57 | bwd: 1008.84 | bwd_inner: 1.90 | bwd_allreduce: 1006.82 | step: 1.91
  6%|▋         | 221/3507 [07:11<1:07:49,  1.24s/it]                                                    {'loss': 0.8359, 'learning_rate': 1.9943630558926588e-05, 'epoch': 0.06}
  6%|▋         | 221/3507 [07:11<1:07:49,  1.24s/it]tensor([[-2.2812, -2.1250, -0.4043,  0.2617, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:51:57,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.56 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.7578, -1.4844,  0.4688,  0.2334, -1.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9219, -0.7500,  0.6172,  1.0703, -0.8633]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1484, -0.9258,  0.8047,  0.8320, -1.0547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344, -2.0625, -0.4238,  0.1846, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5547, -1.3984,  0.0269,  0.5469, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9609, -1.7266,  0.1138,  0.1934, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0469, -1.8047,  0.0566,  0.0957, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:51:58,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:51:58,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.31 | bwd_microstep: 629.46 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 628.33 | step_microstep: 1.64
[2025-11-06 17:51:58,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.89 | bwd: 630.46 | bwd_inner: 1.98 | bwd_allreduce: 628.36 | step: 1.72
  6%|▋         | 222/3507 [07:12<1:03:06,  1.15s/it]                                                    {'loss': 0.7651, 'learning_rate': 1.9942646900718218e-05, 'epoch': 0.06}
  6%|▋         | 222/3507 [07:12<1:03:06,  1.15s/it]tensor([[-0.3418, -0.0977,  1.3359,  1.2344, -0.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0469, -1.8906, -0.3652,  0.2832, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2969, -1.1016,  0.5078,  0.9766, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7188, -2.4375, -0.1875, -0.2598, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:51:58,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.65 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2188, -1.0078,  0.5586,  0.7031, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3750, -2.2188, -0.6055,  0.0718, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.7656, -2.5781, -0.6211, -0.0903, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8906, -2.6250, -0.4512, -0.3516, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 17:52:00,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.09 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 17:52:00,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.26 | bwd_microstep: 1620.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 1620.02 | step_microstep: 3.01
[2025-11-06 17:52:00,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.94 | bwd: 1621.74 | bwd_inner: 1.53 | bwd_allreduce: 1620.06 | step: 3.09
  6%|▋         | 223/3507 [07:14<1:18:32,  1.44s/it]                                                    {'loss': 1.3164, 'learning_rate': 1.994165475874803e-05, 'epoch': 0.06}
  6%|▋         | 223/3507 [07:14<1:18:32,  1.44s/it]tensor([[-0.8203, -0.5469,  1.1328,  0.9453, -0.7305]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:00,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.92 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2031, -1.9844, -0.0176,  0.2422, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0000, -0.6992,  1.1016,  0.7109, -0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9141, -1.7344, -0.1885,  0.2949, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.1719, -1.9375,  0.1055,  0.2656, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2031, -1.9453,  0.1631,  0.0664, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[2.2500, 2.4844, 3.2188, 2.9688, 2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7344, -2.5469, -0.7773, -0.2891, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:06,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 17:52:06,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.58 | bwd_microstep: 5798.42 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 5797.54 | step_microstep: 2.31
[2025-11-06 17:52:06,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.52 | bwd: 5799.11 | bwd_inner: 1.37 | bwd_allreduce: 5797.60 | step: 2.39
  6%|▋         | 224/3507 [07:20<2:37:19,  2.88s/it]                                                    {'loss': 1.0464, 'learning_rate': 1.99406541338626e-05, 'epoch': 0.06}
  6%|▋         | 224/3507 [07:20<2:37:19,  2.88s/it]tensor([[-2.2812, -2.0938, -0.3418,  0.0708, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:07,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.18 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2031, -1.9609, -0.0164,  0.2451, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4688, -1.2969,  0.2188,  0.7305, -1.3672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3438, -1.1406,  0.4629,  0.8711, -1.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -2.6875, -0.7031, -0.1406, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9062, -1.7188,  0.0052,  0.5117, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2969, -1.0391,  0.7969,  0.6250, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9727, -0.7109,  0.9961,  0.6797, -0.8633]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:07,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:52:07,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.46 | bwd_microstep: 161.97 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 160.87 | step_microstep: 1.56
[2025-11-06 17:52:07,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.67 | bwd: 162.95 | bwd_inner: 1.93 | bwd_allreduce: 160.89 | step: 1.63
  6%|▋         | 225/3507 [07:21<1:58:44,  2.17s/it]                                                    {'loss': 0.7241, 'learning_rate': 1.993964502691572e-05, 'epoch': 0.06}
  6%|▋         | 225/3507 [07:21<1:58:44,  2.17s/it]tensor([[-1.6953, -1.4688,  0.2129,  0.4707, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531, -3.2656, -1.2734, -0.6992, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.1387, 0.3223, 1.4844, 1.8281, 0.1396]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:07,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.68 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.7656, -1.5703,  0.1128,  0.6094, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1406, -1.9453, -0.1475,  0.2520, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0938, -1.8984, -0.0310,  0.4258, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3438, -1.1094,  0.5859,  0.8203, -1.2422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000, -2.2812, -0.2539, -0.0796, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:07,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:52:07,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.32 | bwd_microstep: 77.81 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 76.64 | step_microstep: 1.42
[2025-11-06 17:52:07,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.02 | bwd: 78.73 | bwd_inner: 1.92 | bwd_allreduce: 76.68 | step: 1.50
  6%|▋         | 226/3507 [07:21<1:29:52,  1.64s/it]                                                    {'loss': 0.7749, 'learning_rate': 1.9938627438768433e-05, 'epoch': 0.06}
  6%|▋         | 226/3507 [07:21<1:29:52,  1.64s/it]tensor([[-1.9062, -1.6016,  0.3359,  0.1055, -1.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.0781, -1.8203,  0.2695,  0.2383, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9258, -0.7656,  0.4980,  0.9961, -0.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:07,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.21 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.8906, -1.7031, -0.0160,  0.3320, -1.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3281, -2.0938,  0.0111,  0.2852, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9609, -1.7891, -0.1523,  0.3594, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0000, -0.7969,  0.7422,  1.1094, -0.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -4.7188, -1.9062, -1.4531, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:08,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:52:08,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.44 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.71 | step_microstep: 1.46
[2025-11-06 17:52:08,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.68 | bwd: 2.81 | bwd_inner: 1.94 | bwd_allreduce: 0.75 | step: 1.54
  6%|▋         | 227/3507 [07:22<1:09:27,  1.27s/it]                                                    {'loss': 1.002, 'learning_rate': 1.993760137028902e-05, 'epoch': 0.06}
  6%|▋         | 227/3507 [07:22<1:09:27,  1.27s/it]tensor([[-0.3887, -0.0957,  1.3672,  1.0078, -0.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:08,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.01 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0625, -3.8594, -1.6719, -0.9727, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2031, -1.9062,  0.3848,  0.1904, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1562, -1.9062,  0.0635,  0.1592, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5547, -1.3594,  0.2461,  0.6758, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0625, -0.8672,  0.6875,  0.9766, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156, -2.3125, -0.5430,  0.0952, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6641, -1.3984,  0.6523,  0.7266, -1.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:10,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.20 | optimizer_step: 0.25
[2025-11-06 17:52:10,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 1713.77 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1712.63 | step_microstep: 2.30
[2025-11-06 17:52:10,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.45 | bwd: 1714.71 | bwd_inner: 1.92 | bwd_allreduce: 1712.67 | step: 2.37
  7%|▋         | 228/3507 [07:24<1:23:24,  1.53s/it]                                                    {'loss': 0.7183, 'learning_rate': 1.9936566822352998e-05, 'epoch': 0.07}
  7%|▋         | 228/3507 [07:24<1:23:24,  1.53s/it]tensor([[-2.3281, -2.0938, -0.0055,  0.2373, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[1.7031, 1.9531, 2.8594, 2.5781, 1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:10,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.45 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5000, -3.2188, -0.7578, -0.6484, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.0469, -0.8320,  0.6797,  1.0391, -0.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4531, -2.2188, -0.2295,  0.0181, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4688, -2.2188, -0.1982, -0.0776, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2812, -0.9961,  0.9414,  0.6797, -1.1484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5781, -2.3750, -0.5508, -0.1621, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:52:11,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:52:11,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.73 | bwd_microstep: 717.48 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 715.86 | step_microstep: 1.66
[2025-11-06 17:52:11,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.20 | bwd: 718.40 | bwd_inner: 2.37 | bwd_allreduce: 715.90 | step: 1.74
  7%|▋         | 229/3507 [07:25<1:15:57,  1.39s/it]                                                    {'loss': 0.8188, 'learning_rate': 1.9935523795843106e-05, 'epoch': 0.07}
  7%|▋         | 229/3507 [07:25<1:15:57,  1.39s/it]tensor([[-2.2812, -2.0312,  0.0576,  0.2734, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6953, -1.4766,  0.2500,  0.5781, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0625, -1.7969,  0.2217,  0.1855, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:11,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.43 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.4531, -2.2344, -0.3926, -0.0356, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -2.2344, -0.1924, -0.0615, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6562, -1.4844, -0.0649,  0.5547, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.2412, -0.0598,  1.1484,  1.5000, -0.2148]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0312, -2.7656, -0.5117, -0.3691, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:13,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:52:13,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.50 | bwd_microstep: 1309.75 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1308.50 | step_microstep: 2.09
[2025-11-06 17:52:13,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 411.96 | bwd: 1310.78 | bwd_inner: 2.11 | bwd_allreduce: 1308.54 | step: 2.20
  7%|▋         | 230/3507 [07:27<1:22:01,  1.50s/it]                                                    {'loss': 0.7827, 'learning_rate': 1.993447229164933e-05, 'epoch': 0.07}
  7%|▋         | 230/3507 [07:27<1:22:01,  1.50s/it]tensor([[-2.0781, -1.8203,  0.2637,  0.2656, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:13,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.97 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.3750, -2.1406, -0.1187,  0.4531, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5078, -1.2734,  0.5312,  0.7148, -1.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7656, -1.5234,  0.3848,  0.4688, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000, -2.2812, -0.3516,  0.0815, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6406, -1.3516,  0.4922,  0.1816, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.7031, -2.5000, -0.5859,  0.0527, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9688, -1.7578,  0.0022,  0.3027, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:52:13,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.31 | optimizer_step: 0.25
[2025-11-06 17:52:13,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.33 | bwd_microstep: 101.55 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 100.12 | step_microstep: 2.39
[2025-11-06 17:52:13,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.33 | bwd: 102.54 | bwd_inner: 2.18 | bwd_allreduce: 100.18 | step: 2.49
  7%|▋         | 231/3507 [07:27<1:05:26,  1.20s/it]                                                    {'loss': 1.019, 'learning_rate': 1.993341231066889e-05, 'epoch': 0.07}
  7%|▋         | 231/3507 [07:27<1:05:26,  1.20s/it]tensor([[-2.7812, -2.5312, -0.5078, -0.3223, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:13,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.93 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.0469, -1.7891,  0.2158,  0.3945, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0234, -0.7305,  1.0625,  1.0547, -0.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4844, -2.2812, -0.4609, -0.1084, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7578, -0.4590,  1.3672,  1.0547, -0.6602]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9688, -1.7969, -0.1680,  0.4004, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -3.4844, -0.8125, -0.8789, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6797, -1.4922,  0.1025,  0.6484, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:16,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 17:52:16,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.93 | bwd_microstep: 2819.16 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 2817.95 | step_microstep: 2.14
[2025-11-06 17:52:16,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.88 | bwd: 2820.15 | bwd_inner: 2.01 | bwd_allreduce: 2818.00 | step: 2.22
  7%|▋         | 232/3507 [07:30<1:39:19,  1.82s/it]                                                    {'loss': 0.8047, 'learning_rate': 1.9932343853806233e-05, 'epoch': 0.07}
  7%|▋         | 232/3507 [07:30<1:39:19,  1.82s/it]tensor([[-1.9766, -1.7188,  0.3555,  0.4551, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[2.8125, 3.0156, 3.5156, 3.1562, 2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:17,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.02 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2266, -1.0234,  0.5898,  1.0703, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9297, -1.7031,  0.1699,  0.4004, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1406, -2.7812, -0.3555, -0.5586, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2656, -2.0625, -0.4238,  0.1777, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125, -2.0312,  0.2314,  0.2158, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4062, -2.1250,  0.1143,  0.0635, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:52:17,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:52:17,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.04 | bwd_microstep: 158.69 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 157.57 | step_microstep: 1.80
[2025-11-06 17:52:17,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.07 | bwd: 159.69 | bwd_inner: 1.96 | bwd_allreduce: 157.61 | step: 1.88
  7%|▋         | 233/3507 [07:31<1:17:34,  1.42s/it]                                                    {'loss': 0.8394, 'learning_rate': 1.9931266921973042e-05, 'epoch': 0.07}
  7%|▋         | 233/3507 [07:31<1:17:34,  1.42s/it]tensor([[-3.4844, -3.1562, -0.7891, -0.8242, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.2969, -1.0547,  0.6953,  0.8516, -1.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6250, -2.3438,  0.0065,  0.1514, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:17,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.89 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.8047, -1.6172, -0.0167,  0.4727, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2344, -1.9219,  0.1602,  0.0977, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2969, -2.0625, -0.0889,  0.3242, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3906, -1.1328,  0.6680,  0.7500, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0312, -1.6953,  0.5703,  0.2080, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:19,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 17:52:19,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.21 | bwd_microstep: 2043.04 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2041.92 | step_microstep: 2.23
[2025-11-06 17:52:19,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.13 | bwd: 2044.07 | bwd_inner: 1.96 | bwd_allreduce: 2041.97 | step: 2.33
  7%|▋         | 234/3507 [07:33<1:34:06,  1.73s/it]                                                    {'loss': 1.063, 'learning_rate': 1.9930181516088233e-05, 'epoch': 0.07}
  7%|▋         | 234/3507 [07:33<1:34:06,  1.73s/it]tensor([[-1.9688, -1.7422,  0.1641,  0.6172, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8906, -0.6680,  0.8438,  1.0938, -0.8164]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:20,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.97 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.9766, -1.7812, -0.1553,  0.5000, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062, -2.1406,  0.1162,  0.1924, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2812, -2.0156, -0.0184,  0.0840, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6094, -1.2891,  0.7500,  0.3477, -1.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0469, -1.7422,  0.3730,  0.4785, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9609, -1.7578, -0.0894,  0.4863, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:20,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:52:20,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.44 | bwd_microstep: 172.82 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 171.84 | step_microstep: 1.80
[2025-11-06 17:52:20,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.44 | bwd: 173.73 | bwd_inner: 1.74 | bwd_allreduce: 171.87 | step: 1.87
  7%|▋         | 235/3507 [07:34<1:15:17,  1.38s/it]                                                    {'loss': 0.7676, 'learning_rate': 1.992908763707795e-05, 'epoch': 0.07}
  7%|▋         | 235/3507 [07:34<1:15:17,  1.38s/it]tensor([[-1.6562, -1.3750,  0.5664,  0.7344, -1.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[8.5625, 8.7500, 7.9062, 7.0938, 8.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9766, -1.7266,  0.2500,  0.4805, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:20,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.19 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.3848, -0.1689,  1.2656,  1.5234, -0.3477]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1094, -1.8984, -0.1465,  0.3320, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062, -2.1719, -0.2002,  0.1108, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6328, -1.3828,  0.5508,  0.6992, -1.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1226,  0.1318,  1.6406,  1.7500, -0.0918]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:22,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:52:22,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 68.90 | bwd_microstep: 1747.08 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 1746.25 | step_microstep: 1.81
[2025-11-06 17:52:22,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.11 | bwd: 1747.82 | bwd_inner: 1.42 | bwd_allreduce: 1746.28 | step: 1.89
  7%|▋         | 236/3507 [07:36<1:26:08,  1.58s/it]                                                    {'loss': 0.9609, 'learning_rate': 1.9927985285875563e-05, 'epoch': 0.07}
  7%|▋         | 236/3507 [07:36<1:26:08,  1.58s/it]tensor([[-1.7891, -1.5469,  0.3398,  0.7500, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5312, -2.2969, -0.3594,  0.1572, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:22,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.58 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.3281, -1.1172,  0.4746,  0.9609, -1.2266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6719, -1.3672,  0.7148,  0.5156, -1.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -4.2500, -1.4297, -1.0156, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2344, -2.0000, -0.1855,  0.5234, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2969, -2.0938, -0.3965,  0.0991, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.3730, 0.6250, 2.0000, 2.1094, 0.3691]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:52:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:52:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.28 | bwd_microstep: 100.59 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 99.36 | step_microstep: 1.44
[2025-11-06 17:52:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.89 | bwd: 101.53 | bwd_inner: 2.02 | bwd_allreduce: 99.39 | step: 1.52
  7%|▋         | 237/3507 [07:36<1:08:24,  1.26s/it]                                                    {'loss': 0.6841, 'learning_rate': 1.9926874463421676e-05, 'epoch': 0.07}
  7%|▋         | 237/3507 [07:36<1:08:24,  1.26s/it]tensor([[-2.8438, -2.6250, -0.7227,  0.0042, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:23,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.90 | bwd_microstep: 1.15 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-0.2031,  0.0376,  1.5000,  1.6641, -0.1709]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6641, -1.4453,  0.2314,  0.6797, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -2.6562, -0.1885, -0.3301, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -3.0312, -0.5039, -0.2637, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -5.2188, -2.5469, -1.6875, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2656, -2.0312, -0.1387,  0.3008, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2344, -0.9258,  0.7812,  0.4844, -1.1016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:23,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:52:23,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.63 | bwd_microstep: 234.86 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 233.60 | step_microstep: 1.52
[2025-11-06 17:52:23,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.56 | bwd: 236.01 | bwd_inner: 2.26 | bwd_allreduce: 233.63 | step: 1.60
  7%|▋         | 238/3507 [07:37<57:43,  1.06s/it]                                                    {'loss': 0.7754, 'learning_rate': 1.9925755170664123e-05, 'epoch': 0.07}
  7%|▋         | 238/3507 [07:37<57:43,  1.06s/it]tensor([[0.6992, 0.9062, 2.0625, 2.1250, 0.6758]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125, -2.0156,  0.0107,  0.0776, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9062, -1.7109, -0.1924,  0.2812, -1.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:23,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.22 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-2.0312, -1.7109,  0.5391,  0.4902, -1.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3359, -1.1094,  0.5430,  0.9102, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2656, -1.9375,  0.3652,  0.2061, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4062, -2.0781,  0.2520, -0.0121, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -3.6562, -1.2109, -0.9297, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:24,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:52:24,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.18 | bwd_microstep: 418.86 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 417.50 | step_microstep: 1.92
[2025-11-06 17:52:24,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.44 | bwd: 419.83 | bwd_inner: 2.10 | bwd_allreduce: 417.55 | step: 2.04
  7%|▋         | 239/3507 [07:38<55:25,  1.02s/it]                                                  {'loss': 0.7954, 'learning_rate': 1.9924627408557963e-05, 'epoch': 0.07}
  7%|▋         | 239/3507 [07:38<55:25,  1.02s/it]tensor([[-1.9609, -1.7188,  0.1836,  0.5000, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2188, -1.0312,  0.4121,  0.9336, -1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4688, -2.1250,  0.1387,  0.1445, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5000, -3.2031, -0.7070, -0.3770, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2031, -1.9766, -0.1670,  0.3438, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:25,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.22 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.7578, -1.4688,  0.4238,  0.4023, -1.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3281, -2.0312,  0.2275,  0.4355, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5859, -1.3047,  0.5820,  0.6211, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:27,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:52:27,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.50 | bwd_microstep: 1831.47 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1830.44 | step_microstep: 2.00
[2025-11-06 17:52:27,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.74 | bwd: 1832.49 | bwd_inner: 1.88 | bwd_allreduce: 1830.48 | step: 2.08
  7%|▋         | 240/3507 [07:41<1:22:45,  1.52s/it]                                                    {'loss': 0.7402, 'learning_rate': 1.992349117806548e-05, 'epoch': 0.07}
  7%|▋         | 240/3507 [07:41<1:22:45,  1.52s/it]tensor([[-3.3906, -3.0469, -0.3574, -0.3789, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0938, -1.8203,  0.1963,  0.3652, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:27,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.27 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5859, -1.3359,  0.4727,  0.6836, -1.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7344, -1.5000,  0.2090,  0.4609, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -3.4531, -0.8945, -0.5430, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[2.5156, 2.6719, 3.3750, 3.4375, 2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5078, -1.2344,  0.6797,  0.8789, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9219, -3.6719, -1.3281, -0.6523, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:27,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.20 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:52:27,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.98 | bwd_microstep: 200.11 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 199.30 | step_microstep: 3.26
[2025-11-06 17:52:27,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.27 | bwd: 200.88 | bwd_inner: 1.41 | bwd_allreduce: 199.34 | step: 3.34
  7%|▋         | 241/3507 [07:41<1:07:01,  1.23s/it]                                                    {'loss': 0.8513, 'learning_rate': 1.9922346480156184e-05, 'epoch': 0.07}
  7%|▋         | 241/3507 [07:41<1:07:01,  1.23s/it]tensor([[-1.8047, -1.5078,  0.5742,  0.6914, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:27,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.57 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9062, -1.5547,  0.6562,  0.5664, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8984, -1.6406,  0.2256,  0.3750, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -3.7031, -1.3438, -0.9648, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0156e+00, -2.7188e+00, -3.7695e-01,  1.9531e-03, -2.7969e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9531, -3.5938, -0.7734, -0.6641, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4531, -2.0781,  0.3223,  0.0342, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4062, -2.1719, -0.2090,  0.2715, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:29,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:52:29,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.71 | bwd_microstep: 1202.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1201.01 | step_microstep: 1.84
[2025-11-06 17:52:29,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.28 | bwd: 1202.75 | bwd_inner: 1.56 | bwd_allreduce: 1201.05 | step: 1.92
  7%|▋         | 242/3507 [07:43<1:12:37,  1.33s/it]                                                    {'loss': 0.7959, 'learning_rate': 1.9921193315806814e-05, 'epoch': 0.07}
  7%|▋         | 242/3507 [07:43<1:12:37,  1.33s/it]tensor([[-2.6250, -2.2969, -0.2119, -0.2285, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:29,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.13 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8047, -1.5469,  0.3340,  0.5156, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7188, -2.4531, -0.4590, -0.1641, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9922, -1.7656, -0.0205,  0.5664, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7188, -1.4531,  0.5039,  0.6914, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6328, -1.3047,  0.8594,  0.6094, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1875, -1.9609, -0.0474,  0.4609, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6602, -0.4473,  1.0547,  1.5859, -0.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:30,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 17:52:30,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.50 | bwd_microstep: 575.52 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 574.39 | step_microstep: 2.51
[2025-11-06 17:52:30,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.64 | bwd: 576.43 | bwd_inner: 1.87 | bwd_allreduce: 574.43 | step: 2.58
  7%|▋         | 243/3507 [07:44<1:05:33,  1.20s/it]                                                    {'loss': 0.7319, 'learning_rate': 1.9920031686001332e-05, 'epoch': 0.07}
  7%|▋         | 243/3507 [07:44<1:05:33,  1.20s/it]tensor([[-2.9219, -2.5938, -0.3008, -0.0767, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6562, -2.3594, -0.1748,  0.0635, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:30,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.81 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.5781, -2.2812, -0.0083,  0.1455, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9609, -1.6406,  0.4258,  0.4746, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1406, -1.8125,  0.4141,  0.2949, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6719, -1.4297,  0.2754,  0.7617, -1.5391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-1.1016, -0.8672,  0.5977,  0.7422, -0.9961]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2031, -1.9688, -0.0874,  0.5000, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:32,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.26
[2025-11-06 17:52:32,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.96 | bwd_microstep: 1905.55 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1904.50 | step_microstep: 1.99
[2025-11-06 17:52:32,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.79 | bwd: 1906.46 | bwd_inner: 1.75 | bwd_allreduce: 1904.56 | step: 2.09
  7%|▋         | 244/3507 [07:46<1:23:30,  1.54s/it]                                                    {'loss': 1.1001, 'learning_rate': 1.9918861591730915e-05, 'epoch': 0.07}
  7%|▋         | 244/3507 [07:46<1:23:30,  1.54s/it]tensor([[-0.1689,  0.0552,  1.4453,  1.5781, -0.1309]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4688, -6.0938, -2.9062, -2.2656, -6.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6250, -1.3359,  0.6172,  0.6602, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:32,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.79 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.1719, -1.9297, -0.0396,  0.5508, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7656, -1.3906,  0.8086,  0.4453, -1.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4219, -1.2031,  0.5117,  1.0391, -1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3438, -1.9922,  0.4023,  0.1187, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1719, -1.8750,  0.1807,  0.5156, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:33,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:52:33,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.45 | bwd_microstep: 115.15 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 113.85 | step_microstep: 1.92
[2025-11-06 17:52:33,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.26 | bwd: 116.15 | bwd_inner: 2.07 | bwd_allreduce: 113.90 | step: 2.03
  7%|▋         | 245/3507 [07:46<1:06:14,  1.22s/it]                                                    {'loss': 0.7656, 'learning_rate': 1.9917683033993978e-05, 'epoch': 0.07}
  7%|▋         | 245/3507 [07:46<1:06:14,  1.22s/it]tensor([[-2.0938, -1.7188,  0.4766,  0.0845, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.0000, -2.7812, -0.8125, -0.0703, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[4.0938, 4.3438, 4.9375, 4.2812, 3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -2.6875, -0.1602,  0.1875, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5703, -1.2969,  0.6680,  1.0625, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:33,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.32 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9688, -1.7109,  0.1875,  0.4531, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8203, -1.6016,  0.1377,  0.6953, -1.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4062, -2.0625,  0.3281,  0.1768, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:34,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:52:34,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.28 | bwd_microstep: 639.13 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 637.90 | step_microstep: 2.23
[2025-11-06 17:52:34,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 416.62 | bwd: 640.00 | bwd_inner: 1.93 | bwd_allreduce: 637.94 | step: 2.30
  7%|▋         | 246/3507 [07:48<1:16:03,  1.40s/it]                                                    {'loss': 0.9727, 'learning_rate': 1.991649601379614e-05, 'epoch': 0.07}
  7%|▋         | 246/3507 [07:48<1:16:03,  1.40s/it]tensor([[3.8125, 3.9531, 4.2188, 4.1562, 3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3281, -0.9766,  1.0703,  0.9062, -1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:35,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.95 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.9531, -1.6406,  0.4609,  0.3867, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8203, -1.5469,  0.4590,  0.7188, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1172, -0.7930,  1.1953,  0.8320, -0.9883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6328, -1.3828,  0.4395,  0.7070, -1.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1875, -1.9453,  0.0337,  0.7070, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4297, -1.1562,  0.6523,  0.6211, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:35,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:52:35,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.37 | bwd_microstep: 362.98 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 361.97 | step_microstep: 1.71
[2025-11-06 17:52:35,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.35 | bwd: 363.75 | bwd_inner: 1.64 | bwd_allreduce: 362.00 | step: 1.78
  7%|▋         | 247/3507 [07:49<1:05:04,  1.20s/it]                                                    {'loss': 0.8892, 'learning_rate': 1.9915300532150257e-05, 'epoch': 0.07}
  7%|▋         | 247/3507 [07:49<1:05:04,  1.20s/it]tensor([[-2.7656, -2.4062, -0.1055, -0.1748, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.5469, -1.2500,  0.7383,  0.9609, -1.4141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6641, -1.4531,  0.1475,  0.7695, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:35,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.47 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2344, -1.8594,  0.5391,  0.1357, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0781, -1.7266,  0.5977,  0.1562, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5312, -2.2969, -0.4238,  0.2305, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0312, -1.7656,  0.2109,  0.5352, -1.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9375, -2.6094, -0.1582, -0.0903, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:37,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:52:37,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 88.61 | bwd_microstep: 1878.10 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1876.83 | step_microstep: 2.02
[2025-11-06 17:52:37,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 273.10 | bwd: 1878.90 | bwd_inner: 1.91 | bwd_allreduce: 1876.87 | step: 2.09
  7%|▋         | 248/3507 [07:51<1:21:03,  1.49s/it]                                                    {'loss': 1.0107, 'learning_rate': 1.9914096590076394e-05, 'epoch': 0.07}
  7%|▋         | 248/3507 [07:51<1:21:03,  1.49s/it]tensor([[-0.5703, -0.3730,  0.9609,  1.3125, -0.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:37,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.92 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5000, -3.2500, -1.1250, -0.2344, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -2.4844,  0.1270,  0.0166, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7656, -2.4688, -0.2812, -0.1582, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5625, -2.3438, -0.4922,  0.2158, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3906, -2.1562, -0.2598,  0.4590, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.4395, 0.7266, 2.4219, 2.0469, 0.4590]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8438, -1.5391,  0.4355,  0.5156, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:39,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:52:39,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.17 | bwd_microstep: 901.74 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 900.40 | step_microstep: 1.55
[2025-11-06 17:52:39,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.11 | bwd: 902.68 | bwd_inner: 2.12 | bwd_allreduce: 900.43 | step: 1.63
  7%|▋         | 249/3507 [07:52<1:17:47,  1.43s/it]                                                    {'loss': 0.7273, 'learning_rate': 1.991288418860184e-05, 'epoch': 0.07}
  7%|▋         | 249/3507 [07:52<1:17:47,  1.43s/it]tensor([[-2.6719, -2.4219, -0.4688, -0.0184, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:39,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.45 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0938, -1.7266,  0.4609,  0.0889, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.5312, -3.2656, -1.1953, -0.4766, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.0000, -1.6328,  0.8047,  0.4902, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8906, -2.5000, -0.0214, -0.2188, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2500, -1.9141,  0.4883,  0.3555, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6094, -2.3906, -0.5586,  0.1914, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9375, -1.5703,  0.7422,  0.5898, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:39,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 17:52:39,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.05 | bwd_microstep: 110.58 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 109.42 | step_microstep: 1.80
[2025-11-06 17:52:39,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.52 | bwd: 111.53 | bwd_inner: 1.95 | bwd_allreduce: 109.45 | step: 1.87
  7%|▋         | 250/3507 [07:53<1:02:17,  1.15s/it]                                                    {'loss': 1.2725, 'learning_rate': 1.9911663328761097e-05, 'epoch': 0.07}
  7%|▋         | 250/3507 [07:53<1:02:17,  1.15s/it]tensor([[-1.6797, -1.3203,  0.9219,  0.4688, -1.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4375, -1.1250,  0.9141,  0.9648, -1.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3125, -2.0000,  0.2432,  0.5117, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094, -2.8125, -0.5469, -0.2988, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:40,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.99 | bwd_microstep: 1.26 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -4.3750, -1.8438, -0.9492, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7188, -2.4219, -0.1494,  0.2637, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -2.4844, -0.0889,  0.1875, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5156, -1.1250,  1.0391,  0.6445, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:52:41,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:52:41,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.09 | bwd_microstep: 689.50 | bwd_inner_microstep: 1.58 | bwd_allreduce_microstep: 687.82 | step_microstep: 1.67
[2025-11-06 17:52:41,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.11 | bwd: 690.76 | bwd_inner: 2.75 | bwd_allreduce: 687.87 | step: 1.76
  7%|▋         | 251/3507 [07:54<1:08:11,  1.26s/it]                                                    {'loss': 0.6853, 'learning_rate': 1.9910434011595893e-05, 'epoch': 0.07}
  7%|▋         | 251/3507 [07:54<1:08:11,  1.26s/it]tensor([[-3.3594, -3.0000, -0.3125, -0.2031, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2812, -1.0156,  0.8594,  1.1953, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.8125, -1.3438, -0.5391, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:41,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.48 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.6328, -1.2578,  0.7656,  0.4434, -1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.3906, -3.0781, -0.6875, -0.3320, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625, -1.7344,  0.5781,  0.6484, -1.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -4.4375, -1.5469, -1.0625, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8750, -2.5156,  0.1270,  0.2158, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:41,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 17:52:41,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.36 | bwd_microstep: 104.53 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 103.12 | step_microstep: 1.93
[2025-11-06 17:52:41,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.84 | bwd: 105.49 | bwd_inner: 2.20 | bwd_allreduce: 103.16 | step: 2.01
  7%|▋         | 252/3507 [07:55<58:05,  1.07s/it]                                                    {'loss': 0.9622, 'learning_rate': 1.9909196238155166e-05, 'epoch': 0.07}
  7%|▋         | 252/3507 [07:55<58:05,  1.07s/it]tensor([[-2.5156, -2.2812, -0.3945,  0.2490, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -2.3438,  0.2617, -0.0391, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7188, -2.4375, -0.2100,  0.3809, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1562, -0.7891,  1.4297,  0.9805, -1.0234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:42,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.14 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-1.8750, -1.6641, -0.0535,  0.5938, -1.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6250, -2.2812,  0.0258,  0.0405, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0000, -0.7148,  1.1328,  1.2422, -0.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -3.1875, -0.5664, -0.1826, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:44,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:52:44,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.96 | bwd_microstep: 1379.78 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1378.70 | step_microstep: 1.61
[2025-11-06 17:52:44,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 449.13 | bwd: 1380.78 | bwd_inner: 1.91 | bwd_allreduce: 1378.75 | step: 1.71
  7%|▋         | 253/3507 [07:57<1:19:47,  1.47s/it]                                                    {'loss': 0.666, 'learning_rate': 1.990795000949507e-05, 'epoch': 0.07}
  7%|▋         | 253/3507 [07:57<1:19:47,  1.47s/it]tensor([[-2.0156, -1.6562,  0.5977,  0.4902, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2500, -1.9609,  0.1660,  0.7344, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4609, -1.2266,  0.4688,  0.8359, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:44,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.72 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.5430, -0.2539,  1.6016,  1.6797, -0.4805]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2188, -1.9922, -0.2617,  0.4473, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7188, -1.3438,  0.9883,  0.5586, -1.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4297, -1.2031,  0.4688,  1.0312, -1.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7656, -2.4062, -0.0393,  0.1196, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:44,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:52:44,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.02 | bwd_microstep: 34.65 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 33.46 | step_microstep: 1.38
[2025-11-06 17:52:44,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.77 | bwd: 35.62 | bwd_inner: 2.00 | bwd_allreduce: 33.49 | step: 1.46
  7%|▋         | 254/3507 [07:58<1:02:43,  1.16s/it]                                                    {'loss': 0.7554, 'learning_rate': 1.9906695326678975e-05, 'epoch': 0.07}
  7%|▋         | 254/3507 [07:58<1:02:43,  1.16s/it]tensor([[-2.8750, -2.6094, -0.5078,  0.1621, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5625, -1.2812,  0.6055,  0.9844, -1.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -2.7031, -0.2793,  0.0549, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812, -2.0000,  0.0830,  0.5977, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -2.5312, -0.0981, -0.1582, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0781, -1.7266,  0.5898,  0.6875, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:46,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.54 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5156, -1.1875,  0.8359,  0.8945, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3125, -2.0781, -0.2578,  0.4453, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:46,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:52:46,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.23 | bwd_microstep: 1.78 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.68 | step_microstep: 1.90
[2025-11-06 17:52:46,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.79 | bwd: 2.84 | bwd_inner: 1.99 | bwd_allreduce: 0.72 | step: 1.99
  7%|▋         | 255/3507 [08:00<1:17:31,  1.43s/it]                                                    {'loss': 0.6816, 'learning_rate': 1.990543219077746e-05, 'epoch': 0.07}
  7%|▋         | 255/3507 [08:00<1:17:31,  1.43s/it]tensor([[-2.7031, -2.3750,  0.0248,  0.1895, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8047, -1.5078,  0.6875,  0.9727, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:46,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.88 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0781, -2.7188, -0.0410, -0.0105, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -5.0312, -2.4062, -1.4766, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8359, -1.5625,  0.3164,  0.5156, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0078, -0.7227,  0.9023,  0.8750, -0.9023]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4844, -1.1406,  1.0625,  0.9922, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969, -2.9844, -0.6250, -0.2637, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:52:47,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.22 | optimizer_step: 0.22
[2025-11-06 17:52:47,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.31 | bwd_microstep: 138.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 137.02 | step_microstep: 1.95
[2025-11-06 17:52:47,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 276.21 | bwd: 138.91 | bwd_inner: 1.72 | bwd_allreduce: 137.06 | step: 2.02
  7%|▋         | 256/3507 [08:00<1:01:33,  1.14s/it]                                                    {'loss': 0.7478, 'learning_rate': 1.990416060286833e-05, 'epoch': 0.07}
  7%|▋         | 256/3507 [08:00<1:01:33,  1.14s/it]tensor([[-3.0000, -2.6094,  0.2158,  0.2676, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4297, -1.0547,  1.0156,  0.5273, -1.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969, -1.8984,  0.6836,  0.2734, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -2.7812, -0.4941, -0.0518, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6172, -1.3203,  0.6094,  0.7578, -1.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.1172, -0.7617,  1.3828,  0.8711, -0.9805]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:49,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.38 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.3906, -1.1641,  0.4902,  1.0234, -1.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0469, -0.8281,  0.7969,  1.3594, -0.9727]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:49,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:52:49,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.45 | bwd_microstep: 21.02 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 19.82 | step_microstep: 2.37
[2025-11-06 17:52:49,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.86 | bwd: 21.99 | bwd_inner: 2.00 | bwd_allreduce: 19.86 | step: 2.46
  7%|▋         | 257/3507 [08:03<1:27:47,  1.62s/it]                                                    {'loss': 0.6675, 'learning_rate': 1.9902880564036587e-05, 'epoch': 0.07}
  7%|▋         | 257/3507 [08:03<1:27:47,  1.62s/it]tensor([[-1.7578, -1.3203,  1.0859,  0.5234, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.1562, -1.9297, -0.1523,  0.5664, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0703, -0.6719,  1.4531,  0.8984, -0.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.1719, -1.9062,  0.0089,  0.5859, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:50,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3594, -2.0938, -0.1387,  0.4277, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2031, -1.8359,  0.5195,  0.5234, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4219, -3.0469, -0.2617,  0.0095, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4531, -3.0781, -0.4121, -0.3320, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:52:50,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.25
[2025-11-06 17:52:50,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.71 | bwd_microstep: 128.37 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 127.11 | step_microstep: 1.90
[2025-11-06 17:52:50,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.05 | bwd: 129.21 | bwd_inner: 1.94 | bwd_allreduce: 127.14 | step: 1.97
  7%|▋         | 258/3507 [08:04<1:10:13,  1.30s/it]                                                    {'loss': 1.2104, 'learning_rate': 1.9901592075374447e-05, 'epoch': 0.07}
  7%|▋         | 258/3507 [08:04<1:10:13,  1.30s/it]tensor([[-1.8125, -1.4922,  0.6211,  0.8633, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3359, -0.9453,  1.1172,  0.6641, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8867, -0.6953,  0.6641,  1.1094, -0.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2188, -1.9062,  0.2305,  0.6211, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000, -2.1719,  0.1270,  0.3887, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1914,  0.1445,  2.0781,  1.6484, -0.1270]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -3.0938, -0.9805, -0.1670, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:52,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 501.69 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.19
tensor([[-2.1719, -1.7812,  0.7617,  0.5195, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:53,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:52:53,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.48 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.15
[2025-11-06 17:52:53,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 782.16 | bwd: 2.99 | bwd_inner: 1.97 | bwd_allreduce: 0.90 | step: 2.34
  7%|▋         | 259/3507 [08:07<1:36:13,  1.78s/it]                                                    {'loss': 0.7375, 'learning_rate': 1.9900295137981345e-05, 'epoch': 0.07}
  7%|▋         | 259/3507 [08:07<1:36:13,  1.78s/it]tensor([[-3.5000, -3.1719, -0.9570, -0.7383, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.5938, -5.1875, -2.1250, -1.2578, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1250, -1.7656,  0.6367,  0.6680, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:53,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.25 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4219, -2.0781,  0.2930,  0.4531, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4844, -1.1484,  1.0703,  1.0625, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -2.2969,  0.1221,  0.2891, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -2.4531,  0.0996,  0.2617, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[1.5625, 1.8047, 3.0781, 2.7656, 1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:53,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:52:53,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.48 | bwd_microstep: 24.16 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 23.04 | step_microstep: 2.66
[2025-11-06 17:52:53,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 463.75 | bwd: 25.05 | bwd_inner: 1.85 | bwd_allreduce: 23.08 | step: 2.74
  7%|▋         | 260/3507 [08:07<1:15:58,  1.40s/it]                                                    {'loss': 1.041, 'learning_rate': 1.9898989752963915e-05, 'epoch': 0.07}
  7%|▋         | 260/3507 [08:07<1:15:58,  1.40s/it]tensor([[-2.5938, -2.2812,  0.0060,  0.5547, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5156, -1.1406,  0.8984,  0.4258, -1.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:54,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.42 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5000, -1.2266,  0.7305,  1.1562, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4375, -2.1875, -0.1973,  0.5195, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.6719, -0.9023, -0.4902, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1875, -1.9375, -0.1299,  0.2500, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7734, -1.4453,  0.8867,  0.9883, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -4.8750, -1.6719, -1.2656, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:52:54,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 17:52:54,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.56 | bwd_microstep: 644.18 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 643.08 | step_microstep: 2.13
[2025-11-06 17:52:54,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 283.98 | bwd: 645.16 | bwd_inner: 1.91 | bwd_allreduce: 643.13 | step: 2.22
  7%|▋         | 261/3507 [08:08<1:10:33,  1.30s/it]                                                    {'loss': 0.6289, 'learning_rate': 1.9897675921436002e-05, 'epoch': 0.07}
  7%|▋         | 261/3507 [08:08<1:10:33,  1.30s/it]tensor([[-0.1191,  0.2139,  2.0312,  1.6250, -0.0581]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7188, -1.4609,  0.3613,  0.8359, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -3.3438, -0.5781, -0.4707, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:55,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.2969, -2.9844, -0.6719, -0.1250, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0156, -1.6562,  0.7188,  0.7461, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -2.9375, -0.1494, -0.2441, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6016, -1.3906,  0.0811,  0.5469, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625, -2.2656, -0.1953,  0.3965, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:55,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:52:55,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.79 | bwd_microstep: 128.87 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 127.70 | step_microstep: 2.75
[2025-11-06 17:52:55,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 278.10 | bwd: 129.99 | bwd_inner: 2.13 | bwd_allreduce: 127.74 | step: 2.84
  7%|▋         | 262/3507 [08:09<56:33,  1.05s/it]                                                    {'loss': 0.6909, 'learning_rate': 1.989635364451866e-05, 'epoch': 0.07}
  7%|▋         | 262/3507 [08:09<56:33,  1.05s/it]tensor([[-1.9141, -1.5547,  0.7969,  0.7617, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156, -2.1719,  0.0737,  0.2285, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -2.7031, -0.3301, -0.2002, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3281, -1.9375,  0.4453,  0.1895, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3438, -1.9297,  0.5742,  0.2256, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:56,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.36 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2969, -2.8438, -0.1787, -0.5547, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688, -2.6250, -0.2402,  0.0991, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1875, -2.8594, -0.5117,  0.0439, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:52:56,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:52:56,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.12 | bwd_microstep: 205.88 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 204.88 | step_microstep: 2.04
[2025-11-06 17:52:56,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.48 | bwd: 206.85 | bwd_inner: 1.78 | bwd_allreduce: 204.93 | step: 2.13
  7%|▋         | 263/3507 [08:10<1:04:55,  1.20s/it]                                                    {'loss': 0.7378, 'learning_rate': 1.9895022923340152e-05, 'epoch': 0.07}
  7%|▋         | 263/3507 [08:10<1:04:55,  1.20s/it]tensor([[-1.6250, -1.2188,  0.9727,  0.4531, -1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.1562, -1.8359,  0.3105,  0.5000, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:57,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.75 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.7891, -1.3828,  1.1641,  0.8867, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.9414, -0.5391,  1.7031,  1.1875, -0.8164]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -2.8281, -0.4824, -0.0615, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0312, -1.6484,  0.8242,  0.5352, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -4.8750, -2.1562, -1.1484, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8672, -1.4766,  0.8789,  0.7734, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:52:58,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.28 | optimizer_step: 0.28
[2025-11-06 17:52:58,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.85 | bwd_microstep: 1262.86 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1261.62 | step_microstep: 3.23
[2025-11-06 17:52:58,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.61 | bwd: 1263.71 | bwd_inner: 1.84 | bwd_allreduce: 1261.70 | step: 3.35
  8%|▊         | 264/3507 [08:12<1:11:21,  1.32s/it]                                                    {'loss': 0.9016, 'learning_rate': 1.9893683759035937e-05, 'epoch': 0.08}
  8%|▊         | 264/3507 [08:12<1:11:21,  1.32s/it]tensor([[-3.5938, -3.2188, -0.6602, -0.3887, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7422, -1.2969,  1.0156,  0.4863, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3125, -1.9844,  0.1709,  0.4160, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5469, -1.1953,  1.0625,  1.1797, -1.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7969, -1.4297,  0.9180,  0.9688, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:52:58,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.21 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.56 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.20
tensor([[-2.8438, -2.5625, -0.5273,  0.1494, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5000, -1.1875,  0.7969,  1.0312, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2031, -1.7656,  0.6523,  0.1426, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:52:59,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.33 | optimizer_step: 0.31
[2025-11-06 17:52:59,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.01 | bwd_microstep: 441.38 | bwd_inner_microstep: 1.82 | bwd_allreduce_microstep: 439.40 | step_microstep: 3.31
[2025-11-06 17:52:59,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.18 | bwd: 443.20 | bwd_inner: 3.41 | bwd_allreduce: 439.51 | step: 3.54
  8%|▊         | 265/3507 [08:13<1:07:53,  1.26s/it]                                                    {'loss': 0.6665, 'learning_rate': 1.989233615274868e-05, 'epoch': 0.08}
  8%|▊         | 265/3507 [08:13<1:07:53,  1.26s/it]tensor([[-2.5938, -2.1406,  0.4336,  0.0776, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.3398, 0.7305, 2.6406, 1.8203, 0.3965]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:52:59,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.93 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.18
tensor([[-2.1562, -1.8672,  0.1953,  0.6758, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2188, -1.7891,  0.7578,  0.2695, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1875, -1.7422,  0.7812,  0.3164, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8438, -2.4844,  0.1079,  0.3535, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1875, -1.7656,  0.7383,  0.3203, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6172, -1.1953,  0.9570,  0.6172, -1.4453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 17:53:00,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:53:00,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.96 | bwd_microstep: 633.81 | bwd_inner_microstep: 1.89 | bwd_allreduce_microstep: 631.72 | step_microstep: 1.74
[2025-11-06 17:53:00,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.91 | bwd: 635.59 | bwd_inner: 3.50 | bwd_allreduce: 631.80 | step: 1.92
  8%|▊         | 266/3507 [08:14<1:03:44,  1.18s/it]                                                    {'loss': 0.9229, 'learning_rate': 1.9890980105628266e-05, 'epoch': 0.08}
  8%|▊         | 266/3507 [08:14<1:03:44,  1.18s/it]tensor([[-3.6875, -3.3750, -0.9883, -0.1108, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:00,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.27 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.9062, -2.5781, -0.3730,  0.2949, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -5.1562, -2.0781, -1.4219, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5000, -2.1094,  0.3359,  0.4688, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5078, -1.2109,  0.7070,  1.0547, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8672, -1.4297,  0.8945,  0.5586, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2031, -1.8828,  0.3047,  0.7773, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0000, -1.7266,  0.1729,  0.6367, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:02,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.36 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:53:02,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.09 | bwd_microstep: 1635.30 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1634.08 | step_microstep: 3.64
[2025-11-06 17:53:02,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.38 | bwd: 1636.31 | bwd_inner: 2.02 | bwd_allreduce: 1634.14 | step: 3.76
  8%|▊         | 267/3507 [08:16<1:17:09,  1.43s/it]                                                    {'loss': 0.6594, 'learning_rate': 1.988961561883176e-05, 'epoch': 0.08}
  8%|▊         | 267/3507 [08:16<1:17:09,  1.43s/it]tensor([[-1.8359, -1.5703,  0.2500,  0.8945, -1.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:53:02,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.24 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4531, -2.1719, -0.1348,  0.6328, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4531, -2.1250,  0.2051,  0.5352, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -2.0000,  0.6914,  0.0840, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7500, -1.4219,  0.7031,  0.8594, -1.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.7656, -0.5312, -0.0491, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2031, -1.8828,  0.2617,  0.7539, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[1.4688, 1.7266, 3.2031, 3.0625, 1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:03,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.40 | optimizer_step: 0.48
[2025-11-06 17:53:03,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.47 | bwd_microstep: 365.07 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 364.08 | step_microstep: 3.25
[2025-11-06 17:53:03,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.75 | bwd: 366.07 | bwd_inner: 1.73 | bwd_allreduce: 364.16 | step: 3.31
  8%|▊         | 268/3507 [08:17<1:06:07,  1.22s/it]                                                    {'loss': 1.0308, 'learning_rate': 1.9888242693523437e-05, 'epoch': 0.08}
  8%|▊         | 268/3507 [08:17<1:06:07,  1.22s/it][h264 @ 0xdbd3480] mmco: unref short failure
tensor([[-1.3750, -1.0625,  0.9492,  1.2656, -1.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2500, -1.9453,  0.2041,  0.7500, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6875, -2.2344,  0.5352,  0.2070, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:03,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.34 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.8438, -2.5469, -0.4668,  0.1904, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1875, -1.7734,  0.5625,  0.5000, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3750, -2.8906, -0.2354, -0.3145, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1406, -1.7812,  0.6523,  0.6875, -1.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6562, -2.3594, -0.3555,  0.2578, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:05,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.33 | optimizer_step: 0.36
[2025-11-06 17:53:05,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.95 | bwd_microstep: 1724.47 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1723.34 | step_microstep: 2.99
[2025-11-06 17:53:05,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 503.28 | bwd: 1725.44 | bwd_inner: 1.87 | bwd_allreduce: 1723.40 | step: 3.07
  8%|▊         | 269/3507 [08:19<1:27:41,  1.63s/it]                                                    {'loss': 0.6548, 'learning_rate': 1.9886861330874777e-05, 'epoch': 0.08}
  8%|▊         | 269/3507 [08:19<1:27:41,  1.63s/it]tensor([[-4.0625, -3.6250, -0.6836, -0.4883, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:06,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.72 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[5.0938, 5.2500, 5.8125, 5.2500, 4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969, -2.8750, -0.3594, -0.0327, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.4375, -2.1250,  0.0562,  0.5078, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0938, -2.7344, -0.3633,  0.1436, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7031, -1.3750,  0.6484,  0.9648, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8750, -2.4531,  0.1089,  0.1396, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7969, -1.4062,  0.7500,  0.5859, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:53:06,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:53:06,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.54 | bwd_microstep: 66.25 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 65.16 | step_microstep: 1.70
[2025-11-06 17:53:06,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.28 | bwd: 67.19 | bwd_inner: 1.88 | bwd_allreduce: 65.19 | step: 1.77
  8%|▊         | 270/3507 [08:20<1:07:32,  1.25s/it]                                                    {'loss': 1.1724, 'learning_rate': 1.9885471532064456e-05, 'epoch': 0.08}
  8%|▊         | 270/3507 [08:20<1:07:32,  1.25s/it]tensor([[-2.1875, -1.8984,  0.1089,  0.6836, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:06,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.65 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.9688, -1.5469,  0.7656,  0.4375, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7109, -1.3516,  1.0078,  0.9375, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8828, -1.4688,  0.8789,  0.8750, -1.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.9531, -1.2109, -0.7188, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3906, -2.0938,  0.0498,  0.6602, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9062, -1.6016,  0.4844,  1.1797, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4316, -0.0488,  1.8828,  1.4141, -0.3418]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:08,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 17:53:08,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.68 | bwd_microstep: 1789.64 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1788.62 | step_microstep: 1.94
[2025-11-06 17:53:08,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.35 | bwd: 1790.51 | bwd_inner: 1.73 | bwd_allreduce: 1788.66 | step: 2.01
  8%|▊         | 271/3507 [08:22<1:21:35,  1.51s/it]                                                    {'loss': 0.707, 'learning_rate': 1.988407329827834e-05, 'epoch': 0.08}
  8%|▊         | 271/3507 [08:22<1:21:35,  1.51s/it]tensor([[-0.9727, -0.6562,  1.0000,  0.7188, -0.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3203, -1.0469,  0.8203,  1.3906, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:08,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.81 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.0781, -2.7188, -0.4062, -0.0791, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9062, -1.3984,  1.2188,  0.4668, -1.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.2344, -2.8125, -0.1670, -0.0239, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1621,  0.0659,  1.5234,  1.8984, -0.1377]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625, -2.7188, -0.4531,  0.1895, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7891, -1.4453,  0.7930,  1.1172, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:53:08,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 17:53:08,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.03 | bwd_microstep: 81.78 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 80.79 | step_microstep: 1.39
[2025-11-06 17:53:08,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.87 | bwd: 82.80 | bwd_inner: 1.84 | bwd_allreduce: 80.83 | step: 1.48
  8%|▊         | 272/3507 [08:22<1:05:00,  1.21s/it]                                                    {'loss': 1.0103, 'learning_rate': 1.988266663070951e-05, 'epoch': 0.08}
  8%|▊         | 272/3507 [08:22<1:05:00,  1.21s/it]tensor([[-3.2500, -2.7656, -0.1650, -0.2500, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -2.9688, -0.8086,  0.1426, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8828, -1.5703,  0.6250,  1.1953, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:09,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.94 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.9531, -2.5938, -0.2402,  0.2490, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -2.5156, -0.4062,  0.4551, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7500, -2.3438,  0.0422, -0.0747, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4531, -1.1953,  0.6445,  1.3516, -1.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3281, -1.9922,  0.2754,  0.6914, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:11,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:53:11,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.76 | bwd_microstep: 2332.28 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 2331.07 | step_microstep: 1.89
[2025-11-06 17:53:11,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.73 | bwd: 2333.31 | bwd_inner: 2.07 | bwd_allreduce: 2331.11 | step: 1.98
  8%|▊         | 273/3507 [08:25<1:29:27,  1.66s/it]                                                    {'loss': 0.5742, 'learning_rate': 1.9881251530558224e-05, 'epoch': 0.08}
  8%|▊         | 273/3507 [08:25<1:29:27,  1.66s/it]tensor([[-2.3906, -1.9219,  0.8203,  0.4531, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.4238, -0.0962,  1.8438,  1.8203, -0.3633]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 17:53:11,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.87 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([2], device='cuda:3')
tensor([[-3.7969, -3.4531, -1.1562, -0.3906, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2344, -1.8281,  0.5742,  0.5312, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0781, -0.6289,  1.6172,  1.0156, -0.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1875, -1.8672,  0.3066,  0.8750, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2969, -1.9297,  0.2969,  0.4766, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.9844, -0.8789, -0.7344, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:12,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:53:12,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.56 | bwd_microstep: 213.00 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 211.74 | step_microstep: 1.62
[2025-11-06 17:53:12,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.45 | bwd: 214.03 | bwd_inner: 2.14 | bwd_allreduce: 211.77 | step: 1.69
  8%|▊         | 274/3507 [08:25<1:11:20,  1.32s/it]                                                    {'loss': 0.655, 'learning_rate': 1.9879827999031952e-05, 'epoch': 0.08}
  8%|▊         | 274/3507 [08:25<1:11:20,  1.32s/it]tensor([[-3.5625, -3.2500, -1.0000,  0.0500, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:12,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.00 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5156, -2.1875, -0.0154,  0.5078, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7031, -2.2656,  0.3691,  0.3945, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5938, -2.1875,  0.1875,  0.3105, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6562, -1.1875,  1.0234,  0.4668, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.8047, -1.3438,  1.1406,  0.5820, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0156, -1.6016,  0.8008,  0.4141, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -4.0938, -1.3438, -0.8516, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:14,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 17:53:14,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.72 | bwd_microstep: 1848.86 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 1847.46 | step_microstep: 2.83
[2025-11-06 17:53:14,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.74 | bwd: 1849.84 | bwd_inner: 2.22 | bwd_allreduce: 1847.50 | step: 2.92
  8%|▊         | 275/3507 [08:28<1:25:46,  1.59s/it]                                                    {'loss': 0.96, 'learning_rate': 1.9878396037345342e-05, 'epoch': 0.08}
  8%|▊         | 275/3507 [08:28<1:25:46,  1.59s/it]tensor([[-2.4375, -1.9297,  0.6797,  0.1230, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5859, -1.3203,  0.4941,  1.1328, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6562, -1.3594,  0.6133,  1.2344, -1.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:14,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.57 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.1875, -1.8203,  0.4941,  0.6016, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7812, -1.3125,  0.9961,  0.4570, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3750, -1.8984,  0.6602,  0.3652, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8750, -2.5938, -0.5586,  0.3945, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6094, -1.2891,  0.7500,  1.2109, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:14,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:53:14,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.97 | bwd_microstep: 1.53 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.67 | step_microstep: 2.78
[2025-11-06 17:53:14,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.57 | bwd: 2.37 | bwd_inner: 1.50 | bwd_allreduce: 0.72 | step: 2.87
  8%|▊         | 276/3507 [08:28<1:06:55,  1.24s/it]                                                    {'loss': 0.5879, 'learning_rate': 1.9876955646720253e-05, 'epoch': 0.08}
  8%|▊         | 276/3507 [08:28<1:06:55,  1.24s/it]tensor([[-2.1094, -1.7812,  0.2354,  0.6211, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:14,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.66 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.9805, -0.7422,  0.8398,  1.4844, -0.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9453, -1.5703,  0.6602,  0.8164, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2500, -1.8984,  0.3379,  0.6875, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3867,  0.0083,  2.0781,  1.6484, -0.3086]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -2.3281,  0.4453,  0.0189, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6094, -2.1562,  0.6367,  0.4941, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3438, -1.8516,  0.8359,  0.1963, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:17,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.35 | optimizer_step: 0.45
[2025-11-06 17:53:17,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.57 | bwd_microstep: 2772.24 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 2771.34 | step_microstep: 3.34
[2025-11-06 17:53:17,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.26 | bwd: 2773.16 | bwd_inner: 1.56 | bwd_allreduce: 2771.41 | step: 3.43
  8%|▊         | 277/3507 [08:31<1:38:58,  1.84s/it]                                                    {'loss': 0.7002, 'learning_rate': 1.9875506828385723e-05, 'epoch': 0.08}
  8%|▊         | 277/3507 [08:31<1:38:58,  1.84s/it]tensor([[-2.1719, -1.6875,  0.8438,  0.3379, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8984, -1.5625,  0.5469,  0.8594, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -3.9062, -1.4688, -0.4863, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:18,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.56 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.0859, -0.7891,  1.0938,  1.3359, -1.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0000, -1.5000,  1.2344,  0.6836, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4531, -2.0156,  0.2344,  0.0170, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5938, -3.0312, -0.0972, -0.4473, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6875, -3.1719, -0.0045, -0.1167, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:53:18,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:53:18,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.03 | bwd_microstep: 36.22 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 35.08 | step_microstep: 1.63
[2025-11-06 17:53:18,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.62 | bwd: 37.01 | bwd_inner: 1.78 | bwd_allreduce: 35.11 | step: 1.70
  8%|▊         | 278/3507 [08:32<1:17:19,  1.44s/it]                                                    {'loss': 0.937, 'learning_rate': 1.9874049583577983e-05, 'epoch': 0.08}
  8%|▊         | 278/3507 [08:32<1:17:19,  1.44s/it]tensor([[-3.3438, -3.0312, -0.9453,  0.0649, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2188, -1.8984,  0.1699,  0.8711, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:18,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.86 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.3906, -2.0156,  0.2061,  0.3184, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.2344, -1.8906,  0.4824,  1.0938, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-1.9062, -1.4375,  0.9766,  0.3828, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-1.3516, -1.0234,  1.0078,  1.2734, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0000, -2.5156,  0.3340,  0.2500, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9297, -1.5000,  0.9180,  0.6250, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:19,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.22
[2025-11-06 17:53:19,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.01 | bwd_microstep: 1107.57 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1106.42 | step_microstep: 2.10
[2025-11-06 17:53:19,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.89 | bwd: 1108.30 | bwd_inner: 1.73 | bwd_allreduce: 1106.46 | step: 2.17
  8%|▊         | 279/3507 [08:33<1:18:14,  1.45s/it]                                                    {'loss': 0.9072, 'learning_rate': 1.987258391354046e-05, 'epoch': 0.08}
  8%|▊         | 279/3507 [08:33<1:18:14,  1.45s/it]tensor([[-2.7344, -2.3750, -0.0640,  0.7500, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031, -2.3906, -0.3945,  0.5234, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:20,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.66 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.2891, -0.9922,  0.8281,  1.0703, -1.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7422, -1.2656,  1.1641,  0.6641, -1.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7188, -2.3281,  0.1074,  0.5273, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3438, -2.0312,  0.0317,  0.7070, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5781, -2.2812, -0.2461,  0.6367, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4531, -2.9844, -0.0776, -0.0364, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:53:20,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:53:20,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.23 | bwd_microstep: 183.71 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 182.69 | step_microstep: 1.85
[2025-11-06 17:53:20,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.92 | bwd: 184.53 | bwd_inner: 1.68 | bwd_allreduce: 182.72 | step: 1.92
  8%|▊         | 280/3507 [08:34<1:03:51,  1.19s/it]                                                    {'loss': 0.5625, 'learning_rate': 1.9871109819523765e-05, 'epoch': 0.08}
  8%|▊         | 280/3507 [08:34<1:03:51,  1.19s/it]tensor([[-4.9688, -4.5000, -1.5859, -0.7305, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -2.7969, -0.3281,  0.3633, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -3.0000, -0.0359,  0.2158, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:20,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.54 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-2.0781, -1.7734,  0.0952,  0.6914, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688, -2.6250, -0.4199,  0.3594, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8359, -1.3594,  0.9453,  0.5547, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-1.9062, -1.6016,  0.3418,  1.0625, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.2188, -1.2734, -1.1484, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:23,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:53:23,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.23 | bwd_microstep: 2176.48 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 2175.31 | step_microstep: 1.56
[2025-11-06 17:53:23,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.80 | bwd: 2177.44 | bwd_inner: 1.92 | bwd_allreduce: 2175.36 | step: 1.67
  8%|▊         | 281/3507 [08:36<1:25:50,  1.60s/it]                                                    {'loss': 0.9258, 'learning_rate': 1.98696273027857e-05, 'epoch': 0.08}
  8%|▊         | 281/3507 [08:36<1:25:50,  1.60s/it]tensor([[-1.7500, -1.2734,  1.1016,  0.7695, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8125, -3.4375, -1.0469, -0.1206, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -3.4688, -1.0469, -0.0189, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:23,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.92 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0469, -1.6719,  0.6992,  0.8086, -1.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -2.4219,  0.5391, -0.0923, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2969, -2.7969,  0.1221,  0.0452, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5547, -1.1562,  1.1562,  1.3281, -1.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7188, -2.3906, -0.2793,  0.5352, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:23,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:53:23,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.97 | bwd_microstep: 1.76 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.51
[2025-11-06 17:53:23,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.91 | bwd: 2.51 | bwd_inner: 1.67 | bwd_allreduce: 0.73 | step: 1.59
  8%|▊         | 282/3507 [08:37<1:06:49,  1.24s/it]                                                    {'loss': 0.9629, 'learning_rate': 1.9868136364591243e-05, 'epoch': 0.08}
  8%|▊         | 282/3507 [08:37<1:06:49,  1.24s/it]tensor([[-1.7734, -1.3828,  0.8086,  0.9766, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -3.5938, -0.6289, -0.2207, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:23,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.58 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.3438, -1.9219,  0.4980,  0.3887, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2500, -0.9297,  0.8828,  1.2500, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2812, -1.9141,  0.3652,  0.9883, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7500, -1.2969,  1.0391,  0.5039, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0781, -2.6250,  0.1934,  0.2754, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[3.2344, 3.3906, 4.1562, 4.0938, 3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:26,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:53:26,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.37 | bwd_microstep: 2606.47 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 2605.44 | step_microstep: 1.69
[2025-11-06 17:53:26,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.99 | bwd: 2607.38 | bwd_inner: 1.73 | bwd_allreduce: 2605.50 | step: 1.79
  8%|▊         | 283/3507 [08:40<1:36:31,  1.80s/it]                                                    {'loss': 0.7754, 'learning_rate': 1.9866637006212582e-05, 'epoch': 0.08}
  8%|▊         | 283/3507 [08:40<1:36:31,  1.80s/it]tensor([[-2.5312, -2.2031, -0.1377,  0.7148, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:26,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.70 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.1172, -0.8555,  0.8477,  1.5078, -1.0391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -2.5781,  0.1279,  0.2139, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2969, -1.8281,  0.7461,  0.4102, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -2.3281, -0.2119,  0.5820, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0000, -0.7383,  0.9180,  1.7109, -0.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -5.9375, -3.0781, -1.6016, -5.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6797, -1.2031,  1.3750,  1.1250, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:27,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 17:53:27,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.28 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.71 | step_microstep: 1.63
[2025-11-06 17:53:27,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 453.01 | bwd: 2.73 | bwd_inner: 1.86 | bwd_allreduce: 0.74 | step: 1.70
  8%|▊         | 284/3507 [08:40<1:15:31,  1.41s/it]                                                    {'loss': 0.5441, 'learning_rate': 1.986512922892906e-05, 'epoch': 0.08}
  8%|▊         | 284/3507 [08:40<1:15:31,  1.41s/it]tensor([[-2.5781, -2.2500, -0.2754,  0.6289, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8438, -1.3516,  1.0312,  0.5898, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4531, -1.9844,  0.6953,  0.6523, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:27,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.29 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.0156, -1.7031,  0.1641,  1.0391, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.7031, -1.3438, -0.2969, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.8594, -1.1406, -0.5195, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6406, -2.2031,  0.3320,  0.7070, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -2.8750, -0.1660, -0.3926, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 17:53:29,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.25
[2025-11-06 17:53:29,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.98 | bwd_microstep: 1570.67 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1569.65 | step_microstep: 2.01
[2025-11-06 17:53:29,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 437.30 | bwd: 1571.57 | bwd_inner: 1.74 | bwd_allreduce: 1569.70 | step: 2.10
  8%|▊         | 285/3507 [08:42<1:25:50,  1.60s/it]                                                    {'loss': 0.8921, 'learning_rate': 1.9863613034027224e-05, 'epoch': 0.08}
  8%|▊         | 285/3507 [08:42<1:25:50,  1.60s/it]tensor([[-2.4062, -1.8906,  0.8086,  0.4102, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:29,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.90 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5625, -3.0469, -0.2354, -0.1387, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1562, -1.7578,  0.5352,  1.0703, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -1.8906,  0.2109,  0.7148, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5000, -2.1562,  0.0098,  0.9180, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.9688, -1.4141, -0.2021, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.7188, -0.4082,  0.2109, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6211, -0.2754,  1.7266,  1.8906, -0.5664]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:53:29,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.25 | optimizer_step: 0.21
[2025-11-06 17:53:29,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.03 | bwd_microstep: 59.69 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 58.18 | step_microstep: 2.49
[2025-11-06 17:53:29,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.96 | bwd: 60.57 | bwd_inner: 2.20 | bwd_allreduce: 58.21 | step: 2.57
  8%|▊         | 286/3507 [08:43<1:06:23,  1.24s/it]                                                    {'loss': 0.5676, 'learning_rate': 1.98620884228008e-05, 'epoch': 0.08}
  8%|▊         | 286/3507 [08:43<1:06:23,  1.24s/it]tensor([[-1.8828, -1.5312,  0.5000,  1.0938, -1.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:29,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.94 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.0625, -3.5156, -0.7695, -0.5547, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.5156, -2.0625,  0.4609,  0.5664, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-2.4219, -2.0938, -0.1016,  0.6562, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -2.9688, -0.6328,  0.0928, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2188, -1.7969,  0.6211,  0.8320, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5781, -3.0938, -0.5742, -0.4570, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2500, -1.7734,  0.7422,  0.6836, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:30,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:53:30,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.51 | bwd_microstep: 400.78 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 399.51 | step_microstep: 62.25
[2025-11-06 17:53:30,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.48 | bwd: 401.77 | bwd_inner: 2.03 | bwd_allreduce: 399.57 | step: 62.36
  8%|▊         | 287/3507 [08:44<1:00:33,  1.13s/it]                                                    {'loss': 0.7085, 'learning_rate': 1.9860555396550693e-05, 'epoch': 0.08}
  8%|▊         | 287/3507 [08:44<1:00:33,  1.13s/it]tensor([[-4.0625, -3.6875, -1.3984, -0.1934, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5000, -2.0781,  0.1895,  0.8164, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -3.2969, -1.1094, -0.1816, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:30,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.35 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-3.1562, -2.5781,  0.2207, -0.1084, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6172, -1.2188,  1.0312,  1.3047, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4375, -2.0625,  0.0505,  0.5117, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5000, -2.0469,  0.4141,  0.6875, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -2.5000, -0.3730,  0.4492, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:30,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 17:53:30,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.61 | bwd_microstep: 226.26 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 225.26 | step_microstep: 1.92
[2025-11-06 17:53:30,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 190.98 | bwd: 226.94 | bwd_inner: 1.52 | bwd_allreduce: 225.30 | step: 1.98
  8%|▊         | 288/3507 [08:44<49:31,  1.08it/s]                                                    {'loss': 0.5984, 'learning_rate': 1.985901395658498e-05, 'epoch': 0.08}
  8%|▊         | 288/3507 [08:44<49:31,  1.08it/s]tensor([[-2.3281, -1.9609,  0.0354,  0.7969, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2344, -1.8828,  0.0055,  0.7891, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-1.3125, -0.8203,  1.5703,  1.0000, -1.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3438, -1.8516,  0.5078,  0.3848, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7109, -1.2969,  0.9531,  1.0000, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:31,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.41 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8438, -1.5078,  0.4395,  1.1641, -1.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[h264 @ 0xd409a40] mmco: unref short failure
[h264 @ 0xd409a40] mmco: unref short failure
tensor([[-2.3750, -1.8438,  0.9023,  0.5586, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2422, -0.8672,  1.2188,  1.5234, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:33,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 17:53:33,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.82 | bwd_microstep: 1853.24 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1852.19 | step_microstep: 2.40
[2025-11-06 17:53:33,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.26 | bwd: 1854.12 | bwd_inner: 1.77 | bwd_allreduce: 1852.23 | step: 2.48
  8%|▊         | 289/3507 [08:47<1:20:04,  1.49s/it]                                                    {'loss': 1.0095, 'learning_rate': 1.9857464104218933e-05, 'epoch': 0.08}
  8%|▊         | 289/3507 [08:47<1:20:04,  1.49s/it]tensor([[-3.0000, -2.5469, -0.1475,  0.1826, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5391, -1.1250,  1.0234,  1.1562, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4219, -2.8594,  0.2012,  0.1191, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:33,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.52 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.8750, -3.3125, -0.3105, -0.0879, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0625, -1.7578,  0.1299,  0.9805, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -2.8750, -0.0576,  0.0410, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6094, -2.0625,  0.6992,  0.2812, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8594, -2.4375, -0.1235,  0.5156, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:34,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:53:34,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.61 | bwd_microstep: 129.58 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 128.34 | step_microstep: 1.78
[2025-11-06 17:53:34,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 540.16 | bwd: 130.43 | bwd_inner: 1.93 | bwd_allreduce: 128.37 | step: 1.85
  8%|▊         | 290/3507 [08:48<1:07:31,  1.26s/it]                                                    {'loss': 0.7224, 'learning_rate': 1.9855905840774994e-05, 'epoch': 0.08}
  8%|▊         | 290/3507 [08:48<1:07:31,  1.26s/it]tensor([[-1.9844, -1.3906,  1.2188,  0.6328, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0938, -1.6406,  0.5586,  0.1611, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:34,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.09 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.7188, -1.3828,  0.5859,  1.3516, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1250, -1.6016,  0.7930,  0.5625, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4062, -1.9688,  0.4316,  0.9727, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -4.8125, -1.9844, -0.9922, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156, -1.9297,  0.7422,  0.2246, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4844, -2.0469,  0.3555,  0.7773, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:36,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:53:36,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.06 | bwd_microstep: 2037.29 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 2035.99 | step_microstep: 2.37
[2025-11-06 17:53:36,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.18 | bwd: 2038.27 | bwd_inner: 1.98 | bwd_allreduce: 2036.13 | step: 2.45
  8%|▊         | 291/3507 [08:50<1:25:21,  1.59s/it]                                                    {'loss': 0.6213, 'learning_rate': 1.985433916758278e-05, 'epoch': 0.08}
  8%|▊         | 291/3507 [08:50<1:25:21,  1.59s/it]tensor([[-2.2344, -1.8984,  0.0039,  0.8086, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406, -2.7188, -0.2949,  0.2354, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:36,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.22 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6406, -3.0625, -0.1689, -0.2891, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3594, -1.8047,  0.8594,  0.2275, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -3.1406, -0.6367, -0.3438, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6562, -2.3125, -0.2539,  0.7383, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -4.5625, -1.6484, -0.9414, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719, -2.2656, -0.0220,  0.5898, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:53:37,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:53:37,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.85 | bwd_microstep: 96.63 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 95.64 | step_microstep: 1.86
[2025-11-06 17:53:37,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.10 | bwd: 97.41 | bwd_inner: 1.62 | bwd_allreduce: 95.67 | step: 1.93
  8%|▊         | 292/3507 [08:51<1:07:48,  1.27s/it]                                                    {'loss': 0.6313, 'learning_rate': 1.9852764085979088e-05, 'epoch': 0.08}
  8%|▊         | 292/3507 [08:51<1:07:48,  1.27s/it]tensor([[-2.6562, -2.2812, -0.1055,  0.7930, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3594, -1.9844,  0.0165,  0.5742, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:37,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.94 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5312, -2.0625,  0.3516,  0.5820, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4375, -1.9453,  0.6875,  0.7188, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4219, -2.8125, -0.0732, -0.1963, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.0625, -2.5156,  0.1289, -0.2500, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5156, -2.0156,  0.7070,  0.7227, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -2.9844, -0.5547,  0.2910, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:39,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 17:53:39,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.76 | bwd_microstep: 2156.33 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2155.24 | step_microstep: 2.10
[2025-11-06 17:53:39,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.73 | bwd: 2157.13 | bwd_inner: 1.72 | bwd_allreduce: 2155.28 | step: 2.18
  8%|▊         | 293/3507 [08:53<1:28:19,  1.65s/it]                                                    {'loss': 0.9939, 'learning_rate': 1.9851180597307884e-05, 'epoch': 0.08}
  8%|▊         | 293/3507 [08:53<1:28:19,  1.65s/it]tensor([[-3.0312, -2.4688,  0.4980,  0.4902, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:39,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.80 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5312, -3.1406, -1.0312,  0.1069, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7500, -2.3281,  0.0063,  0.5820, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250, -2.1094,  0.3809,  0.2852, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -4.8750, -1.8750, -0.9492, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -3.9688, -1.3594, -0.3359, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -2.4688, -0.0723,  0.3848, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3281, -1.8750,  0.5156,  0.4590, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:40,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:53:40,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.37 | bwd_microstep: 105.10 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 103.88 | step_microstep: 1.66
[2025-11-06 17:53:40,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.19 | bwd: 106.02 | bwd_inner: 1.98 | bwd_allreduce: 103.92 | step: 1.73
  8%|▊         | 294/3507 [08:54<1:09:21,  1.30s/it]                                                    {'loss': 0.6831, 'learning_rate': 1.9849588702920318e-05, 'epoch': 0.08}
  8%|▊         | 294/3507 [08:54<1:09:21,  1.30s/it]tensor([[-2.6562, -2.2812, -0.1992,  0.7656, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:40,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.91 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.4531, -1.9844,  0.4551,  0.4941, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812, -2.3750, -0.1895,  0.5820, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0938, -1.6562,  0.7695,  1.0859, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4688, -2.1094, -0.2217,  0.6133, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406, -2.0625,  0.5977,  0.2295, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -5.9062, -2.4844, -2.0938, -6.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -3.3125, -0.3770,  0.0544, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:42,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 17:53:42,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 80.66 | bwd_microstep: 1534.79 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1533.47 | step_microstep: 2.23
[2025-11-06 17:53:42,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 244.59 | bwd: 1535.91 | bwd_inner: 2.27 | bwd_allreduce: 1533.50 | step: 2.32
  8%|▊         | 295/3507 [08:55<1:17:36,  1.45s/it]                                                    {'loss': 0.6111, 'learning_rate': 1.98479884041747e-05, 'epoch': 0.08}
  8%|▊         | 295/3507 [08:55<1:17:36,  1.45s/it]tensor([[-1.7656, -1.3438,  0.8711,  1.0078, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:42,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 78.96 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.3984, -0.9922,  0.7695,  0.5000, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062, -1.9062,  0.5234,  0.4023, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8125, -2.4062, -0.1279,  0.6641, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8750, -1.3203,  1.3906,  0.9336, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3281, -1.8203,  0.7734,  0.5820, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8203, -1.3750,  0.8828,  1.0547, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8203, -1.2031,  1.4062,  0.5000, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 17:53:42,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 17:53:42,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.15 | bwd_microstep: 230.47 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 229.30 | step_microstep: 1.87
[2025-11-06 17:53:42,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.13 | bwd: 231.40 | bwd_inner: 1.93 | bwd_allreduce: 229.34 | step: 1.95
  8%|▊         | 296/3507 [08:56<1:03:01,  1.18s/it]                                                    {'loss': 0.9878, 'learning_rate': 1.9846379702436518e-05, 'epoch': 0.08}
  8%|▊         | 296/3507 [08:56<1:03:01,  1.18s/it]tensor([[-1.8906, -1.4219,  0.7969,  0.5469, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6328, -1.2969,  0.5430,  1.1562, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3457,  0.0084,  1.9141,  2.1875, -0.3066]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:42,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.61 | bwd_microstep: 1.10 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.4844, -1.9062,  0.7852,  0.4980, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1406, -1.7422,  0.4863,  1.1250, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2188, -1.8906, -0.0201,  0.9883, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3281, -1.9766, -0.1494,  0.6992, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.3750, -1.7656,  1.0547,  0.4043, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:43,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.20 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:53:43,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.71 | bwd_microstep: 933.33 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 932.08 | step_microstep: 3.29
[2025-11-06 17:53:43,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.35 | bwd: 934.43 | bwd_inner: 2.19 | bwd_allreduce: 932.11 | step: 3.38
  8%|▊         | 297/3507 [08:57<1:05:24,  1.22s/it]                                                    {'loss': 0.9211, 'learning_rate': 1.9844762599078427e-05, 'epoch': 0.08}
  8%|▊         | 297/3507 [08:57<1:05:24,  1.22s/it]tensor([[-2.3281, -1.9375,  0.1133,  0.8633, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.9531, -1.4531,  0.8359,  0.7500, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.6172, -1.2656,  0.7695,  1.3281, -1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -2.1562,  0.3027,  0.8242, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:44,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.63 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-2.3750, -1.7734,  0.9453,  0.4395, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3438, -1.7578,  0.9180,  0.2871, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0938, -1.6484,  0.5352,  0.6289, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -3.8281, -1.2109, -0.4824, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:53:44,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:53:44,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.72 | bwd_microstep: 179.67 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 178.68 | step_microstep: 1.53
[2025-11-06 17:53:44,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.38 | bwd: 180.56 | bwd_inner: 1.68 | bwd_allreduce: 178.73 | step: 1.62
  8%|▊         | 298/3507 [08:58<54:41,  1.02s/it]                                                    {'loss': 1.2556, 'learning_rate': 1.9843137095480262e-05, 'epoch': 0.08}
  8%|▊         | 298/3507 [08:58<54:41,  1.02s/it]tensor([[-2.0469, -1.6719,  0.3301,  1.0469, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:44,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.80 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9375, -2.5312, -0.4238,  0.4414, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.1562, -2.7656, -0.6406,  0.4414, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5469, -2.0781,  0.3789,  0.7305, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -2.3750,  0.6328, -0.0347, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0625, -1.6250,  0.6758,  1.1094, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219, -3.3125, -0.1973, -0.2695, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5312, -1.9141,  0.9961,  0.5234, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:47,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:53:47,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.74 | bwd_microstep: 2087.14 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 2085.94 | step_microstep: 3.00
[2025-11-06 17:53:47,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.58 | bwd: 2087.94 | bwd_inner: 1.82 | bwd_allreduce: 2085.99 | step: 3.09
  9%|▊         | 299/3507 [09:00<1:18:01,  1.46s/it]                                                    {'loss': 0.9709, 'learning_rate': 1.9841503193029005e-05, 'epoch': 0.09}
  9%|▊         | 299/3507 [09:00<1:18:01,  1.46s/it]tensor([[-1.9688, -1.3984,  1.2109,  0.7070, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4062, -1.8047,  0.7812,  0.3125, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:47,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.18 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4688, -1.8828,  0.8750,  0.3340, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.8750, -1.7656, -1.6562, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9453, -1.4062,  1.2109,  0.8164, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.6562, -0.8281, -0.5508, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4844, -1.9375,  0.8203,  0.5547, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.7344, -0.7891, -0.1484, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:47,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:53:47,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.75 | bwd_microstep: 138.17 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 137.09 | step_microstep: 1.79
[2025-11-06 17:53:47,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.96 | bwd: 139.11 | bwd_inner: 1.84 | bwd_allreduce: 137.13 | step: 1.87
  9%|▊         | 300/3507 [09:01<1:03:49,  1.19s/it]                                                    {'loss': 0.6733, 'learning_rate': 1.9839860893118824e-05, 'epoch': 0.09}
  9%|▊         | 300/3507 [09:01<1:03:49,  1.19s/it]tensor([[-1.8359, -1.3594,  0.9805,  1.0234, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1719, -1.5156,  1.3125,  0.4844, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9219, -2.3906,  0.3574,  0.5195, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -2.8594, -0.7148,  0.2559, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:53:47,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.78 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.0625, -1.5703,  0.8711,  0.5703, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3125, -2.8594, -0.4590,  0.2871, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.1318, 0.4414, 2.0938, 2.4531, 0.1484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5156, -2.0156,  0.3887,  0.4375, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:49,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.30 | optimizer_step: 0.30
[2025-11-06 17:53:49,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.09 | bwd_microstep: 1639.58 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1638.50 | step_microstep: 2.86
[2025-11-06 17:53:49,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.89 | bwd: 1640.45 | bwd_inner: 1.73 | bwd_allreduce: 1638.57 | step: 2.95
  9%|▊         | 301/3507 [09:03<1:17:55,  1.46s/it]                                                    {'loss': 1.0337, 'learning_rate': 1.983821019715104e-05, 'epoch': 0.09}
  9%|▊         | 301/3507 [09:03<1:17:55,  1.46s/it]tensor([[-1.2266, -0.9219,  0.8242,  1.6641, -1.1172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -3.2656, -0.6992, -0.2695, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0625, -0.6992,  1.3281,  1.9062, -0.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7656, -1.1953,  1.2969,  0.8086, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:49,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.1250, -2.7188, -0.6367,  0.0344, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7422, -0.2676,  1.8203,  1.0547, -0.6133]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -3.4844, -0.9727, -0.2871, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -2.9375, -0.1904, -0.0957, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:50,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 17:53:50,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.68 | bwd_microstep: 2.20 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.45
[2025-11-06 17:53:50,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.00 | bwd: 3.13 | bwd_inner: 2.15 | bwd_allreduce: 0.84 | step: 1.53
  9%|▊         | 302/3507 [09:03<1:01:43,  1.16s/it]                                                    {'loss': 0.6226, 'learning_rate': 1.9836551106534138e-05, 'epoch': 0.09}
  9%|▊         | 302/3507 [09:03<1:01:43,  1.16s/it]tensor([[-4.0312, -3.3750, -0.3223, -0.4258, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -2.4844,  0.1348,  0.2031, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6016, -1.0000,  1.5938,  0.7188, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4531, -2.9531, -0.3281,  0.2266, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -3.2500, -0.8203,  0.0108, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -3.0469, -0.6484,  0.2031, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:51,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.08 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5000, -1.8750,  0.8047,  0.3516, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9062, -1.4609,  0.9219,  1.3281, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:51,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.18 | optimizer_step: 0.29
[2025-11-06 17:53:51,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.52 | bwd_microstep: 576.67 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 575.42 | step_microstep: 2.09
[2025-11-06 17:53:51,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.62 | bwd: 577.72 | bwd_inner: 2.11 | bwd_allreduce: 575.47 | step: 2.18
  9%|▊         | 303/3507 [09:05<1:10:52,  1.33s/it]                                                    {'loss': 0.6045, 'learning_rate': 1.9834883622683775e-05, 'epoch': 0.09}
  9%|▊         | 303/3507 [09:05<1:10:52,  1.33s/it]tensor([[-3.2500, -2.7031,  0.0110,  0.1494, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:52,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.04 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8438, -4.1875, -1.2266, -1.2266, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2969, -1.9297,  0.0767,  0.7539, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5000, -6.6562, -2.9531, -2.8125, -6.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0625, -1.5078,  1.0391,  0.6250, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594, -2.6719,  0.3477, -0.1807, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[4.0312, 4.3438, 5.5000, 4.4375, 3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8438, -1.3984,  0.6250,  0.5312, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:52,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:53:52,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.13 | bwd_microstep: 110.31 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 109.20 | step_microstep: 1.61
[2025-11-06 17:53:52,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.18 | bwd: 111.18 | bwd_inner: 1.83 | bwd_allreduce: 109.23 | step: 1.68
  9%|▊         | 304/3507 [09:06<57:25,  1.08s/it]                                                    {'loss': 0.6787, 'learning_rate': 1.9833207747022772e-05, 'epoch': 0.09}
  9%|▊         | 304/3507 [09:06<57:25,  1.08s/it]tensor([[-1.5547, -1.1016,  1.1641,  1.0859, -1.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -2.8906, -0.1426,  0.0211, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3750, -1.8828,  0.5977,  0.7617, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469, -2.2031, -0.4297,  0.3398, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3906, -1.9141,  0.6172,  1.0312, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719, -1.7188,  0.5078,  0.6250, -1.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:53,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.19 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.20
tensor([[-3.2500, -2.7812, -0.3457,  0.1416, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3125, -1.7031,  1.1016,  0.4473, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:54,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:53:54,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.47 | bwd_microstep: 775.34 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 774.13 | step_microstep: 1.61
[2025-11-06 17:53:54,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.66 | bwd: 776.42 | bwd_inner: 2.10 | bwd_allreduce: 774.18 | step: 1.81
  9%|▊         | 305/3507 [09:07<1:07:40,  1.27s/it]                                                    {'loss': 0.6987, 'learning_rate': 1.983152348098109e-05, 'epoch': 0.09}
  9%|▊         | 305/3507 [09:07<1:07:40,  1.27s/it]tensor([[-0.8242, -0.3633,  1.8984,  1.7734, -0.6992]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8281, -1.4219,  0.6562,  0.8867, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6406, -1.2578,  0.7617,  1.2344, -1.4922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4375, -1.8281,  1.0625,  0.3438, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:54,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.57 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[0.2246, 0.6719, 2.4375, 1.7422, 0.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.8438, -2.4531, -0.3438,  0.5156, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281, -2.8906, -0.4766,  0.4609, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -3.8594, -1.0312, -0.3535, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:55,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 17:53:55,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.31 | bwd_microstep: 163.79 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 162.31 | step_microstep: 2.11
[2025-11-06 17:53:55,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.81 | bwd: 164.73 | bwd_inner: 2.22 | bwd_allreduce: 162.36 | step: 2.20
  9%|▊         | 306/3507 [09:09<1:07:59,  1.27s/it]                                                    {'loss': 0.7954, 'learning_rate': 1.9829830825995874e-05, 'epoch': 0.09}
  9%|▊         | 306/3507 [09:09<1:07:59,  1.27s/it]tensor([[-1.8359, -1.2500,  1.4375,  0.7344, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000e+00, -2.9531e+00, -2.6562e-01,  1.1292e-03, -3.1875e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:53:55,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.10 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5469, -2.0781,  0.5000,  1.0000, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2969, -1.8828,  0.3320,  0.8750, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4375, -2.0156,  0.2539,  0.6914, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9219, -1.3750,  1.3359,  1.0938, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4219, -1.9609,  0.4609,  0.9297, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -4.1562, -1.3672, -0.7305, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:53:56,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:53:56,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.25 | bwd_microstep: 665.45 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 664.25 | step_microstep: 1.83
[2025-11-06 17:53:56,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.37 | bwd: 666.34 | bwd_inner: 1.91 | bwd_allreduce: 664.28 | step: 1.90
  9%|▉         | 307/3507 [09:10<1:04:35,  1.21s/it]                                                    {'loss': 0.6292, 'learning_rate': 1.9828129783511406e-05, 'epoch': 0.09}
  9%|▉         | 307/3507 [09:10<1:04:35,  1.21s/it]tensor([[-0.4824, -0.0544,  1.9219,  1.5859, -0.3711]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -2.4062, -0.2217,  0.7266, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:56,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.17 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.8984, -1.3438,  1.2500,  0.7227, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5938, -2.0312,  0.8711,  0.6797, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -2.9062, -0.4434,  0.3438, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3594, -0.8125,  1.3438,  0.5859, -1.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.9688, -1.4062,  1.0703,  0.5703, -1.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4062, -1.7656,  0.9258, -0.0228, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:53:58,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:53:58,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.62 | bwd_microstep: 235.38 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 233.78 | step_microstep: 1.74
[2025-11-06 17:53:58,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.82 | bwd: 236.23 | bwd_inner: 2.22 | bwd_allreduce: 233.81 | step: 1.82
  9%|▉         | 308/3507 [09:12<1:17:52,  1.46s/it]                                                    {'loss': 0.8376, 'learning_rate': 1.982642035497914e-05, 'epoch': 0.09}
  9%|▉         | 308/3507 [09:12<1:17:52,  1.46s/it]tensor([[-3.6406, -3.1875, -0.7930,  0.1523, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:58,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.02 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2812, -3.5469, -0.4766, -0.7344, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.2500, -1.7734,  0.5273,  0.5664, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2656, -1.8672,  0.2969,  1.0547, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7500, -2.0625,  1.0234,  0.3555, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688, -1.9297,  0.8242,  1.1016, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -2.9219, -0.0737, -0.0322, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5859, -1.2344,  0.7266,  1.5078, -1.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:58,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.19 | optimizer_step: 0.15
[2025-11-06 17:53:58,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.08 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.50
[2025-11-06 17:53:58,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.13 | bwd: 3.12 | bwd_inner: 2.07 | bwd_allreduce: 0.91 | step: 2.57
  9%|▉         | 309/3507 [09:12<1:01:18,  1.15s/it]                                                    {'loss': 0.9644, 'learning_rate': 1.982470254185768e-05, 'epoch': 0.09}
  9%|▉         | 309/3507 [09:12<1:01:18,  1.15s/it]tensor([[-3.8125, -3.3750, -1.0781,  0.0942, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406, -2.0625,  0.6641,  0.3105, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2188, -1.6953,  0.6367,  0.4668, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6875, -2.2656,  0.0403,  0.8477, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9688, -2.5469, -0.1670,  0.7109, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:53:59,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.67 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-0.8281, -0.3262,  1.8125,  1.3984, -0.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6250, -2.1250,  0.3457,  0.6328, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9297, -1.3750,  1.1797,  0.9609, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:00,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:54:00,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.24 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.00
[2025-11-06 17:54:00,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.93 | bwd: 2.99 | bwd_inner: 1.96 | bwd_allreduce: 0.88 | step: 2.11
  9%|▉         | 310/3507 [09:14<1:08:38,  1.29s/it]                                                    {'loss': 0.6091, 'learning_rate': 1.9822976345612784e-05, 'epoch': 0.09}
  9%|▉         | 310/3507 [09:14<1:08:38,  1.29s/it]tensor([[-2.9688, -2.2656,  0.7266, -0.0255, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6094, -2.2031, -0.0815,  0.7617, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -2.7188, -0.4453,  0.1260, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.6602, 1.0938, 3.2188, 2.6562, 0.6914]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5469, -1.8984,  0.9180,  0.1689, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3281, -1.8281,  0.6680,  0.8633, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:00,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.91 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.9922, -1.4453,  1.3203,  1.0781, -1.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.2314, 0.5859, 2.3438, 2.6562, 0.2695]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:54:01,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 17:54:01,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.30 | bwd_microstep: 733.77 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 732.61 | step_microstep: 2.01
[2025-11-06 17:54:01,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.16 | bwd: 734.73 | bwd_inner: 1.93 | bwd_allreduce: 732.66 | step: 2.11
  9%|▉         | 311/3507 [09:15<1:11:13,  1.34s/it]                                                    {'loss': 0.6021, 'learning_rate': 1.982124176771736e-05, 'epoch': 0.09}
  9%|▉         | 311/3507 [09:15<1:11:13,  1.34s/it]tensor([[-1.8047, -1.4531,  0.3145,  0.8867, -1.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -6.0000, -2.8281, -1.6562, -6.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8203, -1.2969,  1.0703,  0.4824, -1.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:02,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.67 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9062, -3.2812, -0.1494, -0.1387, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9297, -1.4844,  0.8086,  1.0078, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4531, -1.9219,  0.6445,  0.6523, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7500, -1.2891,  1.0781,  1.4297, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3281, -1.9844, -0.0879,  0.7695, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:54:05,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.24 | optimizer_step: 0.36
[2025-11-06 17:54:05,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.78 | bwd_microstep: 2521.26 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 2519.89 | step_microstep: 2.59
[2025-11-06 17:54:05,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.46 | bwd: 2522.26 | bwd_inner: 2.14 | bwd_allreduce: 2519.95 | step: 2.68
  9%|▉         | 312/3507 [09:19<1:54:51,  2.16s/it]                                                    {'loss': 0.6208, 'learning_rate': 1.9819498809651472e-05, 'epoch': 0.09}
  9%|▉         | 312/3507 [09:19<1:54:51,  2.16s/it]tensor([[-3.1406, -2.6094, -0.0188,  0.3809, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -3.3594, -0.7812,  0.0071, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812, -1.8125,  0.4863,  0.6328, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.8125, -0.5078, -0.3359, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:06,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.10 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-1.2891, -0.9062,  1.0703,  1.6016, -1.1484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.6875, -0.9180, -0.4141, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0000, -0.5273,  1.7656,  1.4688, -0.8555]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2969, -1.7109,  0.8945,  0.2158, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:06,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 17:54:06,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 268.43 | bwd_microstep: 1.42 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.69 | step_microstep: 1.32
[2025-11-06 17:54:06,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 543.55 | bwd: 2.23 | bwd_inner: 1.40 | bwd_allreduce: 0.72 | step: 1.38
  9%|▉         | 313/3507 [09:20<1:29:47,  1.69s/it]                                                    {'loss': 0.5876, 'learning_rate': 1.981774747290234e-05, 'epoch': 0.09}
  9%|▉         | 313/3507 [09:20<1:29:47,  1.69s/it]tensor([[-1.8438, -1.1875,  1.4141,  0.5703, -1.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:06,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.76 | bwd_microstep: 0.42 | bwd_inner_microstep: 0.35 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.04
tensor([[-3.0000, -2.4375,  0.3496,  0.3984, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.4844, -0.8945,  1.7188,  0.8125, -1.2734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2812, -1.7188,  0.7500,  0.4785, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9375, -1.4219,  1.2422,  1.3047, -1.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -3.1562, -0.7188, -0.1885, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219, -2.3906,  0.3789,  0.5117, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969, -2.8750, -0.6914,  0.3906, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:07,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.12 | optimizer_step: 0.11
[2025-11-06 17:54:07,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.78 | bwd_microstep: 189.57 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 188.91 | step_microstep: 1.11
[2025-11-06 17:54:07,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.57 | bwd: 189.99 | bwd_inner: 0.95 | bwd_allreduce: 188.94 | step: 1.15
  9%|▉         | 314/3507 [09:20<1:10:59,  1.33s/it]                                                    {'loss': 0.6047, 'learning_rate': 1.9815987758964322e-05, 'epoch': 0.09}
  9%|▉         | 314/3507 [09:20<1:10:59,  1.33s/it]tensor([[-2.4375, -1.7734,  1.0469,  0.1167, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -4.5625, -1.4609, -1.3203, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6094, -1.9453,  0.9844,  0.2988, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938, -2.0156,  0.7070,  0.6914, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:07,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.13 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4395,  0.0525,  2.2344,  2.0156, -0.3242]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2500, -1.8594,  0.2070,  1.0547, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0625, -2.3906,  0.6367,  0.1543, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6875, -2.1406,  0.5898,  0.6172, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:07,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:54:07,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.51 | bwd_microstep: 1.83 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.84
[2025-11-06 17:54:07,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 503.67 | bwd: 2.48 | bwd_inner: 1.56 | bwd_allreduce: 0.78 | step: 2.92
  9%|▉         | 315/3507 [09:21<58:25,  1.10s/it]                                                    {'loss': 0.7039, 'learning_rate': 1.981421966933893e-05, 'epoch': 0.09}
  9%|▉         | 315/3507 [09:21<58:25,  1.10s/it]tensor([[-3.5000, -3.0469, -0.7852,  0.2930, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1406, -2.7031, -0.4277,  0.6094, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594, -2.2500,  0.6328,  0.3555, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:07,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.09 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.6953, -1.1484,  1.2500,  1.1484, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4062, -1.7734,  0.9180,  0.1230, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -2.3906, -0.0776,  0.7031, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -3.4375, -0.2910, -0.1436, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688, -1.8906,  0.3711, -0.0410, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:54:11,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:54:11,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.44 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.92
[2025-11-06 17:54:11,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.55 | bwd: 3.03 | bwd_inner: 2.05 | bwd_allreduce: 0.85 | step: 2.99
  9%|▉         | 316/3507 [09:25<1:40:26,  1.89s/it]                                                    {'loss': 0.8787, 'learning_rate': 1.981244320553482e-05, 'epoch': 0.09}
  9%|▉         | 316/3507 [09:25<1:40:26,  1.89s/it]tensor([[-3.1562, -2.4688,  0.5664, -0.0684, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3438, -2.8906, -0.5820,  0.6602, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -2.7656,  0.4883, -0.1084, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -2.9375,  0.0247, -0.5195, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.5781, -0.4961, -0.2129, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:11,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.17 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-2.7344, -2.2969, -0.0220,  0.8086, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4688, -1.9609,  0.4922,  0.7461, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6719, -2.0156,  0.8125,  0.1807, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:11,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 17:54:11,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 79.76 | bwd_microstep: 1.77 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.66 | step_microstep: 1.41
[2025-11-06 17:54:11,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.95 | bwd: 2.76 | bwd_inner: 1.94 | bwd_allreduce: 0.70 | step: 1.50
  9%|▉         | 317/3507 [09:25<1:17:15,  1.45s/it]                                                    {'loss': 0.5354, 'learning_rate': 1.9810658369067795e-05, 'epoch': 0.09}
  9%|▉         | 317/3507 [09:25<1:17:15,  1.45s/it]tensor([[-1.7031, -1.2891,  0.5703,  0.6328, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9688, -1.3438,  1.5312,  0.7891, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -2.6406,  0.4941, -0.0771, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:12,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.37 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.1250, -2.5469,  0.1787,  0.4902, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4531, -2.8594, -0.3359, -0.3320, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5156, -2.0781,  0.0040,  0.8516, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.3125, -3.7656, -0.8633,  0.0613, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -3.5469, -0.7031, -0.1533, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:54:13,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 17:54:13,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.79 | bwd_microstep: 210.03 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 208.97 | step_microstep: 2.43
[2025-11-06 17:54:13,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.17 | bwd: 211.16 | bwd_inner: 2.00 | bwd_allreduce: 209.01 | step: 2.50
  9%|▉         | 318/3507 [09:27<1:23:40,  1.57s/it]                                                    {'loss': 0.99, 'learning_rate': 1.9808865161460807e-05, 'epoch': 0.09}
  9%|▉         | 318/3507 [09:27<1:23:40,  1.57s/it]tensor([[-6.2812, -5.5625, -2.2812, -1.6250, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:13,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.36 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4844, -3.0469, -0.8867,  0.0601, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -2.5312,  0.4980,  0.2930, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -4.0938, -1.2500, -0.5508, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5469, -1.0859,  1.2188,  1.4766, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2031, -1.5391,  1.2578,  0.3633, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -2.4375, -0.2109,  0.6328, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -4.0312, -1.2578, -0.1768, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:54:14,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:54:14,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.23 | bwd_microstep: 773.64 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 772.58 | step_microstep: 2.65
[2025-11-06 17:54:14,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.61 | bwd: 774.61 | bwd_inner: 1.84 | bwd_allreduce: 772.63 | step: 2.74
  9%|▉         | 319/3507 [09:28<1:17:04,  1.45s/it]                                                    {'loss': 0.5916, 'learning_rate': 1.980706358424394e-05, 'epoch': 0.09}
  9%|▉         | 319/3507 [09:28<1:17:04,  1.45s/it]tensor([[-2.8125, -2.1719,  0.4414, -0.0142, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4160, -0.0532,  1.8516,  2.1250, -0.3340]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281, -2.4062, -0.3438,  0.7031, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:15,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.09 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.2969, -1.8125,  0.5547,  1.1719, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9453, -1.4375,  1.1172,  1.3047, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4531, -2.0000,  0.2480,  0.9883, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4688, -1.8203,  1.1406,  0.6211, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1406, -1.5000,  1.4062,  0.7812, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:54:17,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.32 | optimizer_step: 0.30
[2025-11-06 17:54:17,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.83 | bwd_microstep: 1375.54 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1374.46 | step_microstep: 3.03
[2025-11-06 17:54:17,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.93 | bwd: 1376.44 | bwd_inner: 1.77 | bwd_allreduce: 1374.51 | step: 3.12
  9%|▉         | 320/3507 [09:30<1:28:50,  1.67s/it]                                                    {'loss': 0.626, 'learning_rate': 1.9805253638954428e-05, 'epoch': 0.09}
  9%|▉         | 320/3507 [09:30<1:28:50,  1.67s/it]tensor([[1.4062, 1.9453, 3.9062, 2.5781, 1.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -2.8906, -0.2637,  0.4355, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:17,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.82 | bwd_microstep: 1.62 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7969, -2.2344,  0.4414,  0.8516, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3281, -2.8594, -0.4844,  0.6641, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9609, -1.4688,  0.8438,  1.0859, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -2.4844,  0.7695,  0.4805, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -2.7031,  0.2422,  0.4941, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -2.2969,  0.1494,  0.7812, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:17,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.49 | optimizer_step: 0.46
[2025-11-06 17:54:17,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.14 | bwd_microstep: 226.06 | bwd_inner_microstep: 2.02 | bwd_allreduce_microstep: 223.84 | step_microstep: 4.53
[2025-11-06 17:54:17,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.94 | bwd: 227.68 | bwd_inner: 3.52 | bwd_allreduce: 223.91 | step: 4.61
  9%|▉         | 321/3507 [09:31<1:11:02,  1.34s/it]                                                    {'loss': 0.6143, 'learning_rate': 1.9803435327136647e-05, 'epoch': 0.09}
  9%|▉         | 321/3507 [09:31<1:11:02,  1.34s/it]tensor([[-3.4062, -2.9688, -0.7695,  0.5000, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.3906, -1.7969,  1.0000,  0.8789, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:17,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.79 | bwd_microstep: 2.43 | bwd_inner_microstep: 2.12 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.19
tensor([[-1.8984, -1.3906,  1.0781,  1.2266, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -2.9062, -0.2061,  0.5469, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7812, -3.0938,  0.1279,  0.1465, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.4375, -1.0078,  1.1016,  1.5078, -1.2734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9375, -2.2500,  0.9531,  0.4336, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -2.8125, -0.3691,  0.5391, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 17:54:18,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.36 | optimizer_step: 0.29
[2025-11-06 17:54:18,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.97 | bwd_microstep: 794.89 | bwd_inner_microstep: 2.00 | bwd_allreduce_microstep: 792.67 | step_microstep: 3.24
[2025-11-06 17:54:18,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.77 | bwd: 797.31 | bwd_inner: 4.18 | bwd_allreduce: 792.78 | step: 3.43
  9%|▉         | 322/3507 [09:32<1:09:22,  1.31s/it]                                                    {'loss': 1.4397, 'learning_rate': 1.9801608650342104e-05, 'epoch': 0.09}
  9%|▉         | 322/3507 [09:32<1:09:22,  1.31s/it]tensor([[-1.9531, -1.3125,  1.5547,  0.9727, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2812, -1.7188,  0.8672,  0.6992, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -3.8594, -0.7734, -0.0306, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:18,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.80 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.6875, -2.0469,  0.8477,  0.4766, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2188, -1.6953,  0.9219,  1.3516, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[2.0312, 2.5000, 4.1250, 2.8125, 1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.7812, -2.1250,  0.5625,  0.2656, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9219, -2.2812,  0.5938,  0.1982, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:54:19,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 17:54:19,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.07 | bwd_microstep: 30.98 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 29.80 | step_microstep: 1.47
[2025-11-06 17:54:19,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.84 | bwd: 31.98 | bwd_inner: 2.00 | bwd_allreduce: 29.84 | step: 1.56
  9%|▉         | 323/3507 [09:33<55:06,  1.04s/it]                                                    {'loss': 0.7947, 'learning_rate': 1.9799773610129446e-05, 'epoch': 0.09}
  9%|▉         | 323/3507 [09:33<55:06,  1.04s/it]tensor([[-2.4219, -1.9688,  0.3652,  1.0156, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:19,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.54 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.77 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.20
tensor([[-2.2344, -1.8516,  0.1260,  1.1406, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2656, -1.5859,  1.3281,  0.5430, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688, -2.4531,  0.0547,  0.4316, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -2.9688,  0.0601,  0.3652, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0625, -1.4375,  0.8945,  0.1069, -1.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3750, -1.9062,  0.4160,  1.0156, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031, -2.1250,  0.5352,  0.5000, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:54:21,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.55 | optimizer_step: 0.65
[2025-11-06 17:54:21,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.91 | bwd_microstep: 754.90 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 753.66 | step_microstep: 4.55
[2025-11-06 17:54:21,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.47 | bwd: 756.98 | bwd_inner: 2.88 | bwd_allreduce: 753.81 | step: 4.72
  9%|▉         | 324/3507 [09:35<1:11:05,  1.34s/it]                                                    {'loss': 0.6272, 'learning_rate': 1.979793020806446e-05, 'epoch': 0.09}
  9%|▉         | 324/3507 [09:35<1:11:05,  1.34s/it]tensor([[-4.0938, -3.4062, -0.1992, -0.2002, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0312, -2.4844,  0.1250,  0.5703, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0312, -0.4512,  2.0156,  1.1172, -0.8398]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.5781, -1.9141,  0.9180,  0.7188, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:54:21,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.31 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7656, -2.0625,  0.8281,  0.0166, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969, -1.8047,  0.5000,  0.8711, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0000, -0.5586,  1.5547,  1.7734, -0.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -2.9062,  0.1040,  0.2480, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:21,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 17:54:21,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.82 | bwd_microstep: 2.23 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1.06 | step_microstep: 3.02
[2025-11-06 17:54:21,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.15 | bwd: 3.11 | bwd_inner: 1.86 | bwd_allreduce: 1.09 | step: 3.10
  9%|▉         | 325/3507 [09:35<57:32,  1.09s/it]                                                    {'loss': 1.3381, 'learning_rate': 1.9796078445720065e-05, 'epoch': 0.09}
  9%|▉         | 325/3507 [09:35<57:32,  1.09s/it]tensor([[-3.9531, -3.2812, -0.0569, -0.0532, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7188, -2.0938,  0.6211,  0.6602, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:21,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.09 | bwd_microstep: 1.60 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-4.2812, -3.4688,  0.0796, -0.4785, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -2.4219,  0.1602,  0.0449, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1562, -1.7109,  0.5898,  1.4297, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8594, -3.3906, -1.0156,  0.0728, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0156, -2.5312, -0.1030,  0.5742, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4375, -1.9922,  0.2617,  1.0703, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:25,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:54:25,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.55 | bwd_microstep: 1612.06 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1610.94 | step_microstep: 1.61
[2025-11-06 17:54:25,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.62 | bwd: 1613.66 | bwd_inner: 2.45 | bwd_allreduce: 1611.01 | step: 1.76
  9%|▉         | 326/3507 [09:39<1:35:45,  1.81s/it]                                                    {'loss': 0.6399, 'learning_rate': 1.9794218324676314e-05, 'epoch': 0.09}
  9%|▉         | 326/3507 [09:39<1:35:45,  1.81s/it]tensor([[-3.3125, -2.8281, -0.4609,  0.2930, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -2.3125, -0.4727,  0.6523, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -3.0312, -0.1328, -0.0786, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:25,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.88 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.8281, -1.2109,  1.5078,  1.0078, -1.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9688, -2.3438,  0.3652,  0.0337, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7109, -1.1328,  1.4609,  1.1094, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1406, -2.6250, -0.2314,  0.4805, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -2.7969,  0.3809,  0.2051, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:25,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:54:25,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.05 | bwd_microstep: 1.91 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.61
[2025-11-06 17:54:25,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 500.95 | bwd: 2.64 | bwd_inner: 1.70 | bwd_allreduce: 0.82 | step: 1.67
  9%|▉         | 327/3507 [09:39<1:15:39,  1.43s/it]                                                    {'loss': 0.5701, 'learning_rate': 1.9792349846520395e-05, 'epoch': 0.09}
  9%|▉         | 327/3507 [09:39<1:15:39,  1.43s/it]tensor([[-1.0078, -0.5977,  1.2891,  1.6172, -0.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:25,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.45 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.0312, -1.6016,  0.4824,  1.3281, -1.8047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3750, -1.6797,  1.1562,  0.3145, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3750, -1.8516,  0.4941,  0.6523, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -3.2031, -1.0547,  0.1562, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5000, -2.0312,  0.2969,  1.1719, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -2.1875,  0.0708,  0.8438, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719, -1.7969,  0.1021,  1.1016, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:28,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:54:28,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.14 | bwd_microstep: 2038.02 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2036.90 | step_microstep: 1.68
[2025-11-06 17:54:28,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.63 | bwd: 2038.91 | bwd_inner: 1.79 | bwd_allreduce: 2036.97 | step: 1.76
  9%|▉         | 328/3507 [09:41<1:30:13,  1.70s/it]                                                    {'loss': 0.6172, 'learning_rate': 1.979047301284662e-05, 'epoch': 0.09}
  9%|▉         | 328/3507 [09:41<1:30:13,  1.70s/it]tensor([[-1.4453, -0.7852,  1.8750,  0.8359, -1.2109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.3125, -1.5391,  1.2656,  0.1621, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([2], device='cuda:0')
[2025-11-06 17:54:28,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.08 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5156, -2.0625,  0.1289,  0.8633, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5000, -2.8438,  0.3828,  0.2266, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6406, -0.9883,  1.6250,  0.6172, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4609, -1.1328,  0.5039,  1.1172, -1.2891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5000, -2.0938, -0.0850,  0.9844, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.8281, -0.5742, -0.2715, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:54:28,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:54:28,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.05 | bwd_microstep: 159.09 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 158.10 | step_microstep: 1.34
[2025-11-06 17:54:28,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.15 | bwd: 159.91 | bwd_inner: 1.65 | bwd_allreduce: 158.14 | step: 1.42
  9%|▉         | 329/3507 [09:42<1:10:44,  1.34s/it]                                                    {'loss': 0.5269, 'learning_rate': 1.978858782525644e-05, 'epoch': 0.09}
  9%|▉         | 329/3507 [09:42<1:10:44,  1.34s/it]tensor([[-2.8750, -2.1719,  0.5547, -0.0273, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:28,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.57 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.8906, -2.2188,  0.6445,  0.7148, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.1147, 0.5703, 2.5156, 2.3438, 0.1914]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4688, -1.8438,  1.0078,  0.7188, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -2.7812, -0.3359,  0.6797, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1719, -1.6484,  0.8828,  1.2656, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.7969,  0.2793, -0.3730, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5781, -2.9219,  0.2715,  0.3438, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:29,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 17:54:29,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.49 | bwd_microstep: 608.39 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 607.42 | step_microstep: 1.60
[2025-11-06 17:54:29,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.08 | bwd: 609.34 | bwd_inner: 1.76 | bwd_allreduce: 607.45 | step: 1.67
  9%|▉         | 330/3507 [09:43<1:04:48,  1.22s/it]                                                    {'loss': 0.6667, 'learning_rate': 1.9786694285358422e-05, 'epoch': 0.09}
  9%|▉         | 330/3507 [09:43<1:04:48,  1.22s/it]tensor([[-4.0312, -3.5625, -1.1250, -0.0547, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0000, -1.2969,  1.4531,  0.4902, -1.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7500, -2.1562,  0.4668,  0.4590, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:29,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.58 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.9688, -2.4062,  0.2578,  0.5898, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031, -2.0469,  0.8828,  0.8398, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1875, -1.7500,  0.4590,  1.4609, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -2.3281,  0.1060,  0.8477, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8594, -1.3516,  0.8555,  1.0000, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:29,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:54:29,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.69 | bwd_microstep: 1.87 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.71 | step_microstep: 1.69
[2025-11-06 17:54:29,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.31 | bwd: 2.93 | bwd_inner: 2.06 | bwd_allreduce: 0.75 | step: 1.78
  9%|▉         | 331/3507 [09:43<51:07,  1.04it/s]                                                    {'loss': 0.5627, 'learning_rate': 1.978479239476827e-05, 'epoch': 0.09}
  9%|▉         | 331/3507 [09:43<51:07,  1.04it/s]tensor([[-1.9922, -1.3750,  1.0703,  0.4648, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.2891, -0.7656,  1.5859,  1.3906, -1.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:30,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.05 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.9922, -1.3828,  1.1250,  0.4805, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8906, -2.1250,  0.9570,  0.1123, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7344, -2.2031,  0.4121,  0.8945, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3750, -0.8125,  1.7109,  1.7656, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2656, -1.6172,  1.1719,  0.7617, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.8516, -0.5000,  1.2500,  2.1250, -0.7148]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:32,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 17:54:32,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.41 | bwd_microstep: 2664.54 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 2663.19 | step_microstep: 2.04
[2025-11-06 17:54:32,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.48 | bwd: 2665.43 | bwd_inner: 2.05 | bwd_allreduce: 2663.25 | step: 2.12
  9%|▉         | 332/3507 [09:46<1:23:47,  1.58s/it]                                                    {'loss': 1.0198, 'learning_rate': 1.978288215510881e-05, 'epoch': 0.09}
  9%|▉         | 332/3507 [09:46<1:23:47,  1.58s/it]tensor([[-3.2500, -2.6719, -0.0923, -0.2090, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:33,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.10 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.3750, -2.7500,  0.2676,  0.5508, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0000, -2.3906,  0.5391,  0.7656, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7656, -2.1875,  0.4297,  0.6211, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7422, -1.0469,  1.4844,  0.3320, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1719, -1.5703,  1.0547,  0.3730, -1.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4375, -1.7812,  1.0000,  0.5977, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1562, -1.5781,  0.9688,  0.5430, -1.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:54:33,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:54:33,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 47.09 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 46.31 | step_microstep: 1.43
[2025-11-06 17:54:33,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.54 | bwd: 47.86 | bwd_inner: 1.38 | bwd_allreduce: 46.35 | step: 1.52
  9%|▉         | 333/3507 [09:47<1:05:38,  1.24s/it]                                                    {'loss': 0.6331, 'learning_rate': 1.9780963568009996e-05, 'epoch': 0.09}
  9%|▉         | 333/3507 [09:47<1:05:38,  1.24s/it]tensor([[-1.9609, -1.2969,  1.1953,  0.6250, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.1719, -2.5156,  0.4434,  0.1152, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([2], device='cuda:2')
tensor([[0.7305, 1.2734, 3.3281, 2.4688, 0.7773]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:33,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.62 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7188, -2.1562,  0.5742,  1.0156, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2656, -2.5625,  0.5312,  0.0801, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3438, -1.7109,  0.8594,  0.3828, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7656, -1.3125,  0.6992,  1.0078, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -2.7656,  0.4277, -0.3320, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:54:34,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 17:54:34,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.76 | bwd_microstep: 1096.61 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 1095.75 | step_microstep: 3.03
[2025-11-06 17:54:34,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.41 | bwd: 1097.24 | bwd_inner: 1.28 | bwd_allreduce: 1095.80 | step: 3.11
 10%|▉         | 334/3507 [09:48<1:09:16,  1.31s/it]                                                    {'loss': 0.6023, 'learning_rate': 1.9779036635108892e-05, 'epoch': 0.1}
 10%|▉         | 334/3507 [09:48<1:09:16,  1.31s/it]tensor([[-2.2031, -1.8125,  0.0082,  0.9336, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.1406, -1.4297,  1.3281,  0.5977, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7031, -2.2656, -0.1289,  0.8594, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:35,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.06 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1719, -1.4922,  1.1797,  0.1660, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0938, -1.3906,  1.6250,  0.5469, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -2.9531,  0.2158, -0.1621, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -2.5625,  0.2021,  0.6797, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8594, -2.2344,  0.7070,  0.6211, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:54:35,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:54:35,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.82 | bwd_microstep: 35.19 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 34.38 | step_microstep: 1.78
[2025-11-06 17:54:35,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.89 | bwd: 35.89 | bwd_inner: 1.35 | bwd_allreduce: 34.42 | step: 1.86
 10%|▉         | 335/3507 [09:49<58:17,  1.10s/it]                                                    {'loss': 0.9197, 'learning_rate': 1.97771013580497e-05, 'epoch': 0.1}
 10%|▉         | 335/3507 [09:49<58:17,  1.10s/it]tensor([[-2.3281, -1.9219,  0.1182,  1.2266, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4219, -1.9609,  0.2734,  1.1172, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.8438, 1.3672, 3.4219, 2.9219, 0.8789]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:35,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.62 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.52 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0625, -2.4688,  0.3750,  0.3340, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9844, -2.3438,  0.4609,  0.1523, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3125, -1.6328,  1.1875,  0.3652, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2500, -1.5000,  1.4453,  0.4453, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9141, -1.4375,  0.8477,  1.3906, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:54:40,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.86 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:54:40,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.49 | bwd_microstep: 4585.17 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 4584.30 | step_microstep: 2.94
[2025-11-06 17:54:40,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.14 | bwd: 4585.79 | bwd_inner: 1.32 | bwd_allreduce: 4584.34 | step: 3.01
 10%|▉         | 336/3507 [09:54<2:00:57,  2.29s/it]                                                    {'loss': 0.6292, 'learning_rate': 1.9775157738483733e-05, 'epoch': 0.1}
 10%|▉         | 336/3507 [09:54<2:00:57,  2.29s/it]tensor([[-2.4844, -1.9062,  0.6094,  0.5273, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:40,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.61 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5469, -1.9219,  0.7812,  0.7852, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6562, -3.2188, -1.0312,  0.1807, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4062, -0.8906,  1.1875,  0.8867, -1.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7031, -1.0391,  1.8047,  1.0312, -1.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2344, -1.6562,  0.9648,  1.0469, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6094, -1.0703,  1.0000,  0.6172, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8750, -2.3281,  0.3398,  0.9883, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:41,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:54:41,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.51 | bwd_microstep: 170.41 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 169.62 | step_microstep: 1.67
[2025-11-06 17:54:41,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.14 | bwd: 171.16 | bwd_inner: 1.39 | bwd_allreduce: 169.65 | step: 1.75
 10%|▉         | 337/3507 [09:54<1:34:09,  1.78s/it]                                                    {'loss': 0.6221, 'learning_rate': 1.9773205778069418e-05, 'epoch': 0.1}
 10%|▉         | 337/3507 [09:54<1:34:09,  1.78s/it]tensor([[-0.9922, -0.4316,  1.9688,  1.7656, -0.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:41,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.18 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07
tensor([[-1.8672, -1.1562,  1.5859,  0.9531, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1406, -2.3438,  0.7930, -0.2305, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4531, -1.7578,  1.2109,  0.4844, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -2.4688, -0.3008,  0.5312, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9531, -2.2969,  0.6914,  0.5938, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1094, -1.6641,  0.4512,  1.3828, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8281, -2.2969,  0.1138,  0.4512, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:42,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 17:54:42,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.67 | bwd_microstep: 519.73 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 518.86 | step_microstep: 1.87
[2025-11-06 17:54:42,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 416.88 | bwd: 520.41 | bwd_inner: 1.37 | bwd_allreduce: 518.91 | step: 1.95
 10%|▉         | 338/3507 [09:55<1:21:26,  1.54s/it]                                                    {'loss': 0.5557, 'learning_rate': 1.9771245478472308e-05, 'epoch': 0.1}
 10%|▉         | 338/3507 [09:55<1:21:26,  1.54s/it]tensor([[-3.6562, -2.8750,  0.6016,  0.0972, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:42,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.31 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.7969, -2.3438, -0.1299,  0.7539, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7344, -2.1250,  0.5664,  0.4453, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594, -2.1250,  0.9531,  0.1582, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1875, -3.5312, -0.3867,  0.0854, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4844, -1.7891,  1.2500,  0.5469, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6406, -2.1562,  0.1719,  0.8945, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1094, -1.5000,  1.1953,  1.1797, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:54:42,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:54:42,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.76 | bwd_microstep: 203.33 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 202.22 | step_microstep: 2.15
[2025-11-06 17:54:42,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.08 | bwd: 204.31 | bwd_inner: 1.87 | bwd_allreduce: 202.27 | step: 2.25
 10%|▉         | 339/3507 [09:56<1:05:19,  1.24s/it]                                                    {'loss': 0.5537, 'learning_rate': 1.976927684136507e-05, 'epoch': 0.1}
 10%|▉         | 339/3507 [09:56<1:05:19,  1.24s/it]tensor([[-2.5000, -1.9375,  0.6172,  0.7148, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:42,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.58 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.8125, -3.1250,  0.1514,  0.3027, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4062, -1.8359,  0.7266,  0.8438, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2812, -2.4688,  0.8242, -0.3164, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.7344, -2.2188,  0.2178,  0.9141, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0469, -2.3594,  0.5508,  0.2949, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3066,  0.4043,  3.0469,  1.4062, -0.1436]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.4219, -1.6719,  1.3672,  0.2158, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:54:43,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:54:43,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.17 | bwd_microstep: 122.40 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 121.32 | step_microstep: 2.20
[2025-11-06 17:54:43,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.77 | bwd: 123.50 | bwd_inner: 2.01 | bwd_allreduce: 121.36 | step: 2.31
 10%|▉         | 340/3507 [09:56<53:33,  1.01s/it]                                                    {'loss': 1.3289, 'learning_rate': 1.9767299868427475e-05, 'epoch': 0.1}
 10%|▉         | 340/3507 [09:56<53:33,  1.01s/it]tensor([[-1.9531, -1.3438,  1.3516,  1.4062, -1.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3438, -1.6328,  1.4062,  0.8828, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -2.4219,  0.5000,  0.3105, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4062, -1.7422,  1.0938,  0.8320, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:44,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.17 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7969, -2.2344,  0.4434,  0.7617, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -2.5312,  0.1699,  0.7852, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -3.5000, -0.1553, -0.0287, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2031, -1.4922,  1.5000,  0.5430, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:54:45,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:54:45,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.79 | bwd_microstep: 584.50 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 583.17 | step_microstep: 2.51
[2025-11-06 17:54:45,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.97 | bwd: 585.46 | bwd_inner: 2.13 | bwd_allreduce: 583.20 | step: 2.59
 10%|▉         | 341/3507 [09:58<1:08:12,  1.29s/it]                                                    {'loss': 0.5996, 'learning_rate': 1.9765314561346424e-05, 'epoch': 0.1}
 10%|▉         | 341/3507 [09:58<1:08:12,  1.29s/it]tensor([[-2.8125, -2.2969,  0.1738,  0.7578, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0781, -2.3906,  0.5703,  0.5117, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:45,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.79 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.1094, -2.6250, -0.1885,  0.6250, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -2.3906,  0.1816,  0.6328, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9062, -2.2812,  0.6953,  0.6836, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3125, -2.6250,  0.4629,  0.0928, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3125, -2.7969, -0.2158,  0.6289, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -1.9375,  1.3516,  0.3770, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:54:46,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.42 | optimizer_step: 0.44
[2025-11-06 17:54:46,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.37 | bwd_microstep: 673.83 | bwd_inner_microstep: 2.49 | bwd_allreduce_microstep: 671.11 | step_microstep: 4.31
[2025-11-06 17:54:46,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.23 | bwd: 675.88 | bwd_inner: 4.42 | bwd_allreduce: 671.18 | step: 4.40
 10%|▉         | 342/3507 [09:59<1:03:45,  1.21s/it]                                                    {'loss': 0.543, 'learning_rate': 1.9763320921815913e-05, 'epoch': 0.1}
 10%|▉         | 342/3507 [09:59<1:03:45,  1.21s/it]tensor([[-2.4844, -1.8516,  0.7891,  0.7422, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7188, -2.2500, -0.0219,  0.9062, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -2.8906, -0.4473,  0.4648, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -1.9531,  1.0703,  0.7031, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3438, -1.8359,  0.6289,  1.2266, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -2.2031,  0.0986,  0.9023, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:47,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.18 | bwd_microstep: 1.48 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.7812, -2.3750, -0.3047,  0.8867, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156, -1.8203,  0.9844,  0.3828, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:48,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:54:48,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.51 | bwd_microstep: 2.26 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.78
[2025-11-06 17:54:48,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.73 | bwd: 3.74 | bwd_inner: 2.72 | bwd_allreduce: 0.87 | step: 3.87
 10%|▉         | 343/3507 [10:01<1:17:18,  1.47s/it]                                                    {'loss': 0.5, 'learning_rate': 1.9761318951537053e-05, 'epoch': 0.1}
 10%|▉         | 343/3507 [10:01<1:17:18,  1.47s/it]tensor([[-2.9531, -2.2500,  0.7539,  0.4141, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6719, -1.9766,  1.1641,  0.7539, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -5.5000, -1.9219, -1.7031, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:48,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.85 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8438, -3.2188, -0.3262,  0.0923, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0625, -2.3125,  0.9023, -0.0391, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1406, -2.5312,  0.3926,  1.0625, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -3.1875,  0.0649,  0.1318, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -3.2188, -0.6406,  0.3008, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:51,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 17:54:51,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.64 | bwd_microstep: 2453.14 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2452.03 | step_microstep: 3.34
[2025-11-06 17:54:51,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.50 | bwd: 2454.26 | bwd_inner: 2.06 | bwd_allreduce: 2452.07 | step: 3.42
 10%|▉         | 344/3507 [10:04<1:39:00,  1.88s/it]                                                    {'loss': 0.541, 'learning_rate': 1.9759308652218074e-05, 'epoch': 0.1}
 10%|▉         | 344/3507 [10:04<1:39:00,  1.88s/it]tensor([[-2.2031, -1.4375,  1.3906,  0.1699, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688, -2.2656,  0.6328, -0.0659, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9688, -1.3906,  1.0781,  1.0859, -1.7109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2344, -1.7422,  0.5000,  1.0078, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -2.9062, -0.5195, -0.0688, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -2.5000, -0.4570,  0.7109, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3750, -1.9531,  0.1074,  0.9141, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:51,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.23 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3750, -2.7500,  0.1816,  0.2451, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:51,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.32 | optimizer_step: 0.36
[2025-11-06 17:54:51,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.77 | bwd_microstep: 4.49 | bwd_inner_microstep: 2.51 | bwd_allreduce_microstep: 1.80 | step_microstep: 2.92
[2025-11-06 17:54:51,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.02 | bwd: 5.58 | bwd_inner: 3.54 | bwd_allreduce: 1.84 | step: 3.00
 10%|▉         | 345/3507 [10:05<1:20:14,  1.52s/it]                                                    {'loss': 0.543, 'learning_rate': 1.9757290025574297e-05, 'epoch': 0.1}
 10%|▉         | 345/3507 [10:05<1:20:14,  1.52s/it]tensor([[-3.1250, -2.4375,  0.6367,  0.3262, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -4.0938, -1.5469, -0.1533, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:54:51,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.20 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4688, -2.9062, -0.2041,  0.4844, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031, -2.2188,  0.1118,  1.0703, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4531, -1.6719,  1.4531,  0.1416, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1094, -1.4375,  1.3438,  1.1406, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7812, -2.3125, -0.0090,  1.0625, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406, -2.1094,  0.4434,  1.1719, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:53,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.49 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 17:54:53,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.59 | bwd_microstep: 1796.46 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 1795.20 | step_microstep: 3.93
[2025-11-06 17:54:53,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.80 | bwd: 1797.38 | bwd_inner: 2.00 | bwd_allreduce: 1795.25 | step: 4.01
 10%|▉         | 346/3507 [10:07<1:29:56,  1.71s/it]                                                    {'loss': 0.9287, 'learning_rate': 1.975526307332816e-05, 'epoch': 0.1}
 10%|▉         | 346/3507 [10:07<1:29:56,  1.71s/it]tensor([[-2.0469, -1.5703,  0.6992,  1.5703, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0781, -1.6797,  0.2754,  1.1875, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -3.9219, -0.5039, -0.0601, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250e+00, -2.1875e+00, -8.6975e-04,  9.7266e-01, -2.3281e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9609, -1.2812,  1.3125,  0.1768, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0000, -2.5000, -0.0593,  0.8750, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -2.6250,  0.1553,  0.2295, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:54,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.11 | bwd_microstep: 1.44 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-1.7500, -1.0781,  1.9375,  1.4531, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:55,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 17:54:55,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.93 | bwd_microstep: 2.82 | bwd_inner_microstep: 1.84 | bwd_allreduce_microstep: 0.89 | step_microstep: 3.35
[2025-11-06 17:54:55,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 483.06 | bwd: 4.26 | bwd_inner: 3.15 | bwd_allreduce: 0.94 | step: 3.46
 10%|▉         | 347/3507 [10:09<1:26:29,  1.64s/it]                                                    {'loss': 0.4875, 'learning_rate': 1.97532277972092e-05, 'epoch': 0.1}
 10%|▉         | 347/3507 [10:09<1:26:29,  1.64s/it]tensor([[-2.3281, -1.5938,  1.3594,  0.4004, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:55,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.54 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.4688, -1.7891,  0.8945,  0.1943, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.1719, -2.6406, -0.0933,  0.3438, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031, -2.5781,  0.2383,  0.5742, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -3.0469, -0.2158,  0.4219, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3125, -2.7188,  0.1172,  0.6328, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -1.8672,  0.8867,  0.8516, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469, -2.8906,  0.1045,  0.2812, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:54:57,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.38 | optimizer_gradients: 0.18 | optimizer_step: 0.24
[2025-11-06 17:54:57,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.04 | bwd_microstep: 1754.12 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 1753.22 | step_microstep: 3.77
[2025-11-06 17:54:57,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.61 | bwd: 1754.95 | bwd_inner: 1.53 | bwd_allreduce: 1753.27 | step: 3.85
 10%|▉         | 348/3507 [10:11<1:33:37,  1.78s/it]                                                    {'loss': 0.9487, 'learning_rate': 1.975118419895405e-05, 'epoch': 0.1}
 10%|▉         | 348/3507 [10:11<1:33:37,  1.78s/it]tensor([[-2.4375, -1.9062,  0.6484,  1.4375, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -2.4844,  0.1982,  0.6328, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -3.1875, -0.4902,  0.5352, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4844, -2.8125,  0.1904,  0.2100, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.2344,  0.5430, -0.6445, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062, -1.8359,  0.8242,  1.0703, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8125, -1.2812,  1.2500,  1.5469, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:54:58,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.34 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-2.7344, -2.2500,  0.1455,  1.0078, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:58,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.50 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:54:58,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.24 | bwd_microstep: 2.11 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.89 | step_microstep: 3.96
[2025-11-06 17:54:58,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.59 | bwd: 3.22 | bwd_inner: 2.12 | bwd_allreduce: 0.95 | step: 4.10
 10%|▉         | 349/3507 [10:12<1:21:45,  1.55s/it]                                                    {'loss': 0.5496, 'learning_rate': 1.9749132280306456e-05, 'epoch': 0.1}
 10%|▉         | 349/3507 [10:12<1:21:45,  1.55s/it]tensor([[-2.8594, -2.0781,  1.2031,  0.4941, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6250, -1.8438,  1.1250, -0.0703, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:54:58,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.08 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.7812, -1.2500,  1.2344,  1.1406, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5469, -0.9844,  1.5469,  1.7812, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3438, -1.7188,  1.0234,  1.2188, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719, -1.5078,  1.1875,  0.4766, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4375, -2.7656,  0.4668,  0.6602, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -4.1875, -0.2773, -0.9375, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:54:59,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 17:54:59,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.53 | bwd_microstep: 809.66 | bwd_inner_microstep: 2.22 | bwd_allreduce_microstep: 807.26 | step_microstep: 1.90
[2025-11-06 17:54:59,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 420.67 | bwd: 810.65 | bwd_inner: 3.14 | bwd_allreduce: 807.31 | step: 1.98
 10%|▉         | 350/3507 [10:13<1:17:18,  1.47s/it]                                                    {'loss': 0.5967, 'learning_rate': 1.974707204301726e-05, 'epoch': 0.1}
 10%|▉         | 350/3507 [10:13<1:17:18,  1.47s/it]tensor([[-1.8672, -1.4609,  0.5156,  1.4375, -1.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -4.0312, -0.6250, -0.7031, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -2.8906,  0.4551,  0.0554, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:54:59,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.03 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0000, -2.2500,  1.0547,  0.2832, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5625, -2.8438,  0.4590,  0.1030, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6484, -1.0781,  1.5547,  1.3906, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4062, -5.5312, -1.6406, -1.8906, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7344, -1.1719,  1.3281,  1.6875, -1.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:00,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.37 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:55:00,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.08 | bwd_microstep: 26.83 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 25.96 | step_microstep: 3.65
[2025-11-06 17:55:00,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.13 | bwd: 27.65 | bwd_inner: 1.53 | bwd_allreduce: 25.99 | step: 3.73
 10%|█         | 351/3507 [10:13<1:00:15,  1.15s/it]                                                    {'loss': 0.7856, 'learning_rate': 1.97450034888444e-05, 'epoch': 0.1}
 10%|█         | 351/3507 [10:13<1:00:15,  1.15s/it]tensor([[-1.8750, -1.2500,  1.3047,  0.7617, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:00,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.06 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.8906, -2.3281,  0.3652,  1.0000, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656, -2.7188, -0.0571,  0.6602, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750, -2.6250,  0.8164,  0.4316, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3438, -0.6445,  2.0156,  1.0859, -1.1172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8438, -2.1406,  0.9688,  0.7812, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7891, -1.1172,  1.6953,  1.1406, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8281, -2.1875,  0.6562,  0.9219, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:01,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:55:01,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.14 | bwd_microstep: 1380.04 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1378.94 | step_microstep: 2.78
[2025-11-06 17:55:01,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.21 | bwd: 1381.05 | bwd_inner: 1.94 | bwd_allreduce: 1378.98 | step: 2.87
 10%|█         | 352/3507 [10:15<1:10:34,  1.34s/it]                                                    {'loss': 0.6089, 'learning_rate': 1.974292661955291e-05, 'epoch': 0.1}
 10%|█         | 352/3507 [10:15<1:10:34,  1.34s/it]tensor([[-1.7578, -1.2578,  1.1406,  1.6797, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5156, -1.9531,  0.6289,  1.1094, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -3.0938,  0.5430, -0.2197, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9219, -2.4531, -0.0757,  0.9102, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -1.9531,  0.2637,  1.1719, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:02,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.69 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5781, -3.0312, -0.3711,  0.4922, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -2.4844,  0.1289,  0.6914, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1250, -2.6875, -0.5391,  0.4941, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:02,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 17:55:02,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.86 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.03
[2025-11-06 17:55:02,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.47 | bwd: 2.83 | bwd_inner: 1.78 | bwd_allreduce: 0.92 | step: 2.10
 10%|█         | 353/3507 [10:16<58:33,  1.11s/it]                                                    {'loss': 0.5789, 'learning_rate': 1.9740841436914917e-05, 'epoch': 0.1}
 10%|█         | 353/3507 [10:16<58:33,  1.11s/it]tensor([[-3.5938, -2.7969,  0.5820,  0.0325, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469, -2.0312,  0.4883,  1.1406, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -2.8438, -0.2197,  0.1826, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625, -1.9062,  1.0234,  0.6250, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:02,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.76 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8125, -5.0938, -1.6094, -0.8789, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3594, -1.6719,  1.3438,  0.5859, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2812, -1.6406,  1.1094,  0.8164, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4531, -1.9453,  0.3750,  0.8789, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:04,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 17:55:04,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.51 | bwd_microstep: 1734.43 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1733.37 | step_microstep: 3.26
[2025-11-06 17:55:04,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.21 | bwd: 1735.36 | bwd_inner: 1.82 | bwd_allreduce: 1733.41 | step: 3.34
 10%|█         | 354/3507 [10:18<1:14:46,  1.42s/it]                                                    {'loss': 0.6675, 'learning_rate': 1.9738747942709652e-05, 'epoch': 0.1}
 10%|█         | 354/3507 [10:18<1:14:46,  1.42s/it]tensor([[-2.3594, -1.9297,  0.1079,  1.0781, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8125, -2.2969,  0.1309,  0.7461, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.6250, -2.9219,  0.3711,  0.1572, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:04,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5781, -2.9531,  0.0161,  0.2910, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -2.2188,  1.1641,  0.0359, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438, -2.1875,  0.6875,  0.6914, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -4.5625, -1.3281, -0.5938, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -3.1094,  0.3086,  0.5898, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:05,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:55:05,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.44 | bwd_microstep: 118.43 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 117.20 | step_microstep: 2.55
[2025-11-06 17:55:05,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.76 | bwd: 119.29 | bwd_inner: 1.92 | bwd_allreduce: 117.24 | step: 2.64
 10%|█         | 355/3507 [10:18<1:00:16,  1.15s/it]                                                    {'loss': 0.9719, 'learning_rate': 1.9736646138723423e-05, 'epoch': 0.1}
 10%|█         | 355/3507 [10:18<1:00:16,  1.15s/it]tensor([[-4.0625, -3.3281,  0.1729,  0.1289, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -2.8438,  0.8867, -0.0732, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -2.8125,  0.6641,  0.2930, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -3.1719,  0.2832,  0.3164, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:05,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 84.92 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.6250, -1.9688,  0.7852,  0.4023, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2656, -1.7656,  0.7109,  1.3750, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3320, 0.8125, 2.8438, 2.9219, 0.3828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4844, -2.8906, -0.0383,  0.5508, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:08,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.21 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:55:08,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.04 | bwd_microstep: 3140.36 | bwd_inner_microstep: 2.85 | bwd_allreduce_microstep: 3137.32 | step_microstep: 3.58
[2025-11-06 17:55:08,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.01 | bwd: 3141.40 | bwd_inner: 3.83 | bwd_allreduce: 3137.36 | step: 3.67
 10%|█         | 356/3507 [10:22<1:40:16,  1.91s/it]                                                    {'loss': 0.6067, 'learning_rate': 1.9734536026749643e-05, 'epoch': 0.1}
 10%|█         | 356/3507 [10:22<1:40:16,  1.91s/it]tensor([[-1.8047, -1.0859,  1.7969,  1.0469, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1562, -1.6406,  0.8203,  1.4609, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:09,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.64 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07
tensor([[-3.0625, -2.2656,  1.2344,  0.2520, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.4375, -0.9609, -0.4922, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.6094, -0.3281, -0.2676, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7500, -2.1719,  0.5859,  1.1875, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4375, -1.7031,  1.2969,  0.5000, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969, -2.7969, -0.3027,  0.6836, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:09,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.35 | optimizer_step: 0.30
[2025-11-06 17:55:09,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.48 | bwd_microstep: 30.89 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 29.70 | step_microstep: 3.13
[2025-11-06 17:55:09,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.14 | bwd: 31.77 | bwd_inner: 1.83 | bwd_allreduce: 29.76 | step: 3.19
 10%|█         | 357/3507 [10:23<1:17:21,  1.47s/it]                                                    {'loss': 0.4873, 'learning_rate': 1.9732417608588803e-05, 'epoch': 0.1}
 10%|█         | 357/3507 [10:23<1:17:21,  1.47s/it]tensor([[-3.3438, -2.6406,  0.7500,  0.8516, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:09,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.32 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-4.6250, -3.8594, -0.2500, -0.1641, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812, -2.6719,  0.1406,  0.4863, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -2.6562, -0.0488,  0.8633, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.7812, -0.8477, -0.7578, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625, -1.9531,  0.8516,  1.2656, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0469, -2.6094, -0.4258,  0.5664, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4531, -0.7539,  2.0312,  1.0547, -1.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:55:11,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 17:55:11,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.82 | bwd_microstep: 1553.99 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1552.78 | step_microstep: 2.53
[2025-11-06 17:55:11,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.14 | bwd: 1555.06 | bwd_inner: 2.10 | bwd_allreduce: 1552.82 | step: 2.63
 10%|█         | 358/3507 [10:25<1:24:30,  1.61s/it]                                                    {'loss': 0.5896, 'learning_rate': 1.9730290886048487e-05, 'epoch': 0.1}
 10%|█         | 358/3507 [10:25<1:24:30,  1.61s/it]tensor([[-2.1562, -1.3828,  1.6484,  0.4434, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6719, -2.1250,  0.4492,  1.0469, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:11,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.89 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.4375, -1.7656,  0.9648,  0.6094, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2500, -0.7070,  1.7344,  1.9375, -1.0547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9375, -0.1729,  2.5469,  1.1328, -0.7227]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4766, -0.9102,  1.4609,  1.5859, -1.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0000, -2.3125,  0.7930,  0.5625, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9531, -2.2812,  0.6250,  0.6055, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:11,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:55:11,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.15 | bwd_microstep: 25.25 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 24.05 | step_microstep: 2.77
[2025-11-06 17:55:11,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.07 | bwd: 26.27 | bwd_inner: 2.04 | bwd_allreduce: 24.09 | step: 2.86
 10%|█         | 359/3507 [10:25<1:05:35,  1.25s/it]                                                    {'loss': 0.6013, 'learning_rate': 1.972815586094336e-05, 'epoch': 0.1}
 10%|█         | 359/3507 [10:25<1:05:35,  1.25s/it]tensor([[-3.4375, -2.9375, -0.4570,  0.7227, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.0156, -1.3750,  1.3281,  1.1406, -1.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500, -2.1875,  0.5039,  1.1250, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -1.8750,  1.4609,  0.2383, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:11,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.72 | bwd_microstep: 2.19 | bwd_inner_microstep: 1.93 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.19
tensor([[-2.6562, -1.8203,  1.6172,  0.4238, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312, -2.7500,  0.7305,  0.2676, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.6406, -0.6562, -0.0713, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3281, -2.6875,  0.3086,  0.4590, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:55:12,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 17:55:12,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.12 | bwd_microstep: 390.75 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 389.65 | step_microstep: 2.52
[2025-11-06 17:55:12,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.89 | bwd: 392.94 | bwd_inner: 2.98 | bwd_allreduce: 389.74 | step: 2.72
 10%|█         | 360/3507 [10:26<1:01:09,  1.17s/it]                                                    {'loss': 1.0239, 'learning_rate': 1.9726012535095182e-05, 'epoch': 0.1}
 10%|█         | 360/3507 [10:26<1:01:09,  1.17s/it]tensor([[-1.4688, -0.8477,  1.7422,  1.1328, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.7969, 1.3125, 3.3281, 2.6094, 0.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625, -3.1094, -0.8672,  0.4277, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8047, -1.2188,  1.3594,  1.2031, -1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7656, -1.3281,  0.4844,  0.6250, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:12,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.31 | bwd_microstep: 2.49 | bwd_inner_microstep: 2.15 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.21
tensor([[-2.5938, -2.0938,  0.2441,  0.7109, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -5.1250, -1.3750, -1.0156, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4688, -1.7891,  1.2188,  1.1484, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:55:13,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:55:13,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.68 | bwd_microstep: 332.22 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 331.09 | step_microstep: 2.83
[2025-11-06 17:55:13,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.01 | bwd: 334.71 | bwd_inner: 3.24 | bwd_allreduce: 331.18 | step: 3.04
 10%|█         | 361/3507 [10:27<55:28,  1.06s/it]                                                    {'loss': 0.5945, 'learning_rate': 1.9723860910332783e-05, 'epoch': 0.1}
 10%|█         | 361/3507 [10:27<55:28,  1.06s/it]tensor([[-2.8750, -2.1250,  1.0391,  0.1924, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -3.2969, -0.2715, -0.3887, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.5820, -0.1182,  1.9844,  2.5625, -0.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -2.6250,  0.1436,  0.7812, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -2.3750,  0.2471,  0.6719, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.7188, -1.0469, -0.3027, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:14,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.15 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.1562, -1.5469,  1.1484,  0.7305, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2812, -1.7812,  0.6602,  1.5000, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:14,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:55:14,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.56 | bwd_microstep: 2.17 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.09
[2025-11-06 17:55:14,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.67 | bwd: 3.14 | bwd_inner: 2.22 | bwd_allreduce: 0.80 | step: 2.18
 10%|█         | 362/3507 [10:28<1:01:04,  1.17s/it]                                                    {'loss': 0.5154, 'learning_rate': 1.972170098849208e-05, 'epoch': 0.1}
 10%|█         | 362/3507 [10:28<1:01:04,  1.17s/it]tensor([[-3.9844, -3.2812, -0.1826, -0.0581, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -2.1250,  0.9922,  0.7305, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:15,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.86 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6875, -1.8984,  1.1641,  0.2754, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -3.3594, -0.0134,  0.2275, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4844, -0.7656,  1.7812,  0.5156, -1.2266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.3438, -1.9219,  0.1357,  0.8711, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5312, -3.0000, -0.1992,  0.7383, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -2.7344,  0.4629,  1.0078, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:55:15,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:55:15,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.20 | bwd_microstep: 225.72 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 224.66 | step_microstep: 1.87
[2025-11-06 17:55:15,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.08 | bwd: 226.66 | bwd_inner: 1.80 | bwd_allreduce: 224.69 | step: 1.95
 10%|█         | 363/3507 [10:29<51:36,  1.02it/s]                                                    {'loss': 0.9045, 'learning_rate': 1.971953277141607e-05, 'epoch': 0.1}
 10%|█         | 363/3507 [10:29<51:36,  1.02it/s]tensor([[-3.3750, -2.6562,  0.4668,  0.0884, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8359, -1.2422,  1.4062,  1.6328, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6250, -1.7891,  1.4219,  0.2773, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1094, -2.4688,  0.4590,  0.4297, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:55:15,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.71 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9688, -2.3438,  0.6914,  0.8281, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -2.4688,  0.9766,  0.0221, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -3.5312, -0.6016,  0.2197, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719, -2.4688,  0.7383,  0.4160, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:55:17,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:55:17,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.21 | bwd_microstep: 1442.78 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 1441.20 | step_microstep: 1.64
[2025-11-06 17:55:17,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.95 | bwd: 1443.94 | bwd_inner: 2.56 | bwd_allreduce: 1441.24 | step: 1.73
 10%|█         | 364/3507 [10:31<1:05:09,  1.24s/it]                                                    {'loss': 0.6208, 'learning_rate': 1.971735626095483e-05, 'epoch': 0.1}
 10%|█         | 364/3507 [10:31<1:05:09,  1.24s/it]tensor([[-3.2812, -2.7344, -0.0140,  0.9609, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -3.4062, -0.1719,  0.3516, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:17,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.08 | bwd_microstep: 1.80 | bwd_inner_microstep: 1.63 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.6875, -2.0625,  0.8125,  0.9336, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -2.6406, -0.2520,  0.8828, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-4.0625, -3.3750, -0.1504,  0.0718, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2969, -1.7344,  0.8594,  1.3594, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -3.1875, -0.6016,  0.6016, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -2.5625,  1.1094,  0.5352, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:18,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 17:55:18,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.32 | bwd_microstep: 417.58 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 416.03 | step_microstep: 2.22
[2025-11-06 17:55:18,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.42 | bwd: 419.38 | bwd_inner: 3.08 | bwd_allreduce: 416.11 | step: 2.33
 10%|█         | 365/3507 [10:31<59:09,  1.13s/it]                                                    {'loss': 1.0198, 'learning_rate': 1.9715171458965505e-05, 'epoch': 0.1}
 10%|█         | 365/3507 [10:31<59:09,  1.13s/it]tensor([[-1.1016, -0.5430,  1.9219,  2.0938, -0.9023]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0469, -2.3125,  0.8672,  0.7148, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8281, -2.3281,  0.2354,  1.2188, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3281, -1.5234,  1.6172,  0.2119, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:18,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.67 | bwd_microstep: 1.09 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.4375, -1.8281,  0.8711,  0.4219, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8281, -2.0000,  1.2891, -0.0767, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -2.8750,  0.3945,  0.2578, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4375, -1.8828,  0.7266,  1.2031, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:19,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.16 | optimizer_step: 0.22
[2025-11-06 17:55:19,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.74 | bwd_microstep: 1175.89 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1174.71 | step_microstep: 3.10
[2025-11-06 17:55:19,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.41 | bwd: 1176.98 | bwd_inner: 2.11 | bwd_allreduce: 1174.75 | step: 3.17
 10%|█         | 366/3507 [10:33<1:10:07,  1.34s/it]                                                    {'loss': 0.5134, 'learning_rate': 1.9712978367312326e-05, 'epoch': 0.1}
 10%|█         | 366/3507 [10:33<1:10:07,  1.34s/it]tensor([[-3.2031, -2.6406,  0.1113,  0.7500, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0938, -2.6094, -0.0579,  1.2812, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812, -1.7812,  0.6445,  1.3359, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5938, -1.8594,  1.4297,  1.0234, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -2.7344,  0.7422, -0.0757, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -2.3125,  1.0781,  0.3555, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -2.5312,  0.5703,  0.8359, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:20,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.85 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5000, -3.8125, -0.6094, -0.3340, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:20,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.17 | optimizer_step: 0.25
[2025-11-06 17:55:20,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.23 | bwd_microstep: 2.45 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 1.03 | step_microstep: 1.93
[2025-11-06 17:55:20,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 496.14 | bwd: 3.57 | bwd_inner: 2.32 | bwd_allreduce: 1.08 | step: 2.02
 10%|█         | 367/3507 [10:34<1:01:09,  1.17s/it]                                                    {'loss': 0.7432, 'learning_rate': 1.9710776987866597e-05, 'epoch': 0.1}
 10%|█         | 367/3507 [10:34<1:01:09,  1.17s/it]tensor([[-3.6875, -3.1562, -0.4785,  0.6523, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.1953, -0.4863,  2.5000,  1.7422, -0.9648]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -3.2031,  0.1152,  0.4941, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5312, -1.6719,  1.6953,  0.1699, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:21,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.22 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.2344, -1.4766,  1.4219,  0.4863, -1.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.1250, -1.5000,  1.2656,  1.0312, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719, -2.2500, -0.2500,  0.7734, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8594, -2.3438,  0.2148,  1.1250, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:55:21,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:55:21,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.94 | bwd_microstep: 170.27 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 169.05 | step_microstep: 1.73
[2025-11-06 17:55:21,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 235.15 | bwd: 171.23 | bwd_inner: 1.99 | bwd_allreduce: 169.09 | step: 1.82
 10%|█         | 368/3507 [10:35<55:06,  1.05s/it]                                                    {'loss': 1.3093, 'learning_rate': 1.9708567322506676e-05, 'epoch': 0.1}
 10%|█         | 368/3507 [10:35<55:06,  1.05s/it]tensor([[-2.8438, -2.0000,  1.5000,  0.2324, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7031, -2.2656, -0.0898,  1.1641, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -1.9141,  1.2656, -0.1099, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.4062, -0.0635,  0.0664, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -4.0000, -0.5781, -0.2773, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8984, -1.2812,  1.3672,  1.2500, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -2.3438,  0.4766,  0.6094, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:23,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.80 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.21
tensor([[-2.7344, -2.0000,  1.2891,  0.6680, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:23,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:55:23,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.23 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.43
[2025-11-06 17:55:23,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 506.05 | bwd: 3.09 | bwd_inner: 1.99 | bwd_allreduce: 0.95 | step: 2.65
 11%|█         | 369/3507 [10:37<1:17:47,  1.49s/it]                                                    {'loss': 0.5208, 'learning_rate': 1.9706349373118012e-05, 'epoch': 0.11}
 11%|█         | 369/3507 [10:37<1:17:47,  1.49s/it]tensor([[-2.7500, -2.2188,  0.3691,  1.0000, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938, -2.0469,  0.6719,  1.0156, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:24,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.75 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0000, -2.1875,  1.0234,  0.1001, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.6172, -0.8828,  2.0469,  1.5781, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -4.4062, -0.9180, -0.8945, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -5.6562, -2.2031, -1.1172, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7188, -2.0938,  0.7227,  0.6094, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6875, -2.1719,  0.4121,  1.2969, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:24,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 17:55:24,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.23 | bwd_microstep: 71.22 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 70.09 | step_microstep: 2.10
[2025-11-06 17:55:24,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.00 | bwd: 72.05 | bwd_inner: 1.77 | bwd_allreduce: 70.13 | step: 2.18
 11%|█         | 370/3507 [10:38<1:01:21,  1.17s/it]                                                    {'loss': 1.0728, 'learning_rate': 1.9704123141593114e-05, 'epoch': 0.11}
 11%|█         | 370/3507 [10:38<1:01:21,  1.17s/it]tensor([[-2.3906, -2.0000, -0.0045,  1.2344, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -3.0156, -0.3438,  0.5156, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2656, -2.5156,  0.8594,  0.1699, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8438, -0.9727,  2.1250,  0.4316, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0938, -1.5938,  0.8047,  1.5156, -1.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -2.8594,  0.8711, -0.1221, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4531, -1.9844,  0.3398,  1.4219, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:27,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.06 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.8438, -2.3594,  0.1162,  1.4688, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:27,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:55:27,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.93 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.31
[2025-11-06 17:55:27,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.00 | bwd: 3.01 | bwd_inner: 2.00 | bwd_allreduce: 0.86 | step: 2.41
 11%|█         | 371/3507 [10:41<1:31:34,  1.75s/it]                                                    {'loss': 0.3594, 'learning_rate': 1.970188862983156e-05, 'epoch': 0.11}
 11%|█         | 371/3507 [10:41<1:31:34,  1.75s/it]tensor([[-4.0625, -3.4219, -0.3223,  0.2969, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.4844, -0.1367, -0.1299, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -4.2500, -0.9688, -0.4609, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:27,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.83 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-3.2031, -2.7500, -0.5391,  0.8047, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.9375, -4.2812, -1.0781, -0.5664, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -2.7969,  0.0300,  0.3828, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8281, -3.2969, -0.6367,  0.3281, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062, -2.0625,  1.4922,  0.2598, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:55:28,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:55:28,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 74.19 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 73.19 | step_microstep: 1.49
[2025-11-06 17:55:28,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.07 | bwd: 75.19 | bwd_inner: 1.77 | bwd_allreduce: 73.25 | step: 1.58
 11%|█         | 372/3507 [10:41<1:11:31,  1.37s/it]                                                    {'loss': 0.9802, 'learning_rate': 1.9699645839739987e-05, 'epoch': 0.11}
 11%|█         | 372/3507 [10:41<1:11:31,  1.37s/it]tensor([[-3.1719, -2.3594,  1.1172,  0.3184, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781, -1.8750,  1.0156,  0.3770, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -2.9062,  0.4473,  0.5547, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -3.6562, -0.6523,  0.2812, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9375, -3.1719,  0.2734,  0.0679, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5938, -1.9062,  1.0078,  0.7852, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -2.1406,  0.7617,  0.7930, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:29,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.30 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8438, -2.3750, -0.1021,  1.2188, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:29,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:55:29,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.03 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.11
[2025-11-06 17:55:29,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 267.34 | bwd: 2.91 | bwd_inner: 1.97 | bwd_allreduce: 0.82 | step: 2.20
 11%|█         | 373/3507 [10:43<1:10:23,  1.35s/it]                                                    {'loss': 0.5378, 'learning_rate': 1.9697394773232104e-05, 'epoch': 0.11}
 11%|█         | 373/3507 [10:43<1:10:23,  1.35s/it]tensor([[-0.8438, -0.4297,  1.3438,  2.0625, -0.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562, -2.5000,  0.4961,  0.9609, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031, -1.9531,  1.1641,  0.9023, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:29,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.11 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-4.1250, -3.3594, -0.0120, -0.0659, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031, -1.9297,  1.3672,  0.7930, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.8438, -6.0312, -2.1094, -1.6484, -6.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -3.0938,  0.1113,  0.5312, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3594, -1.5156,  1.6016,  0.1514, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:55:29,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:55:29,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.67 | bwd_microstep: 52.84 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 51.39 | step_microstep: 1.45
[2025-11-06 17:55:29,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.82 | bwd: 53.68 | bwd_inner: 2.14 | bwd_allreduce: 51.42 | step: 1.54
 11%|█         | 374/3507 [10:43<56:30,  1.08s/it]                                                    {'loss': 0.7085, 'learning_rate': 1.9695135432228678e-05, 'epoch': 0.11}
 11%|█         | 374/3507 [10:43<56:30,  1.08s/it]tensor([[-2.4531, -1.9688,  0.4023,  1.6328, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3574,  0.2715,  2.6875,  2.2969, -0.2051]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406, -1.9922,  1.0781,  1.1094, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7031, -2.1250,  0.6094,  1.2188, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4375, -1.4922,  1.7578,  0.1553, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.0469, -1.5000,  0.9805,  1.7188, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719, -2.6719, -0.2012,  0.9766, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:32,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.45 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7500, -4.0312, -0.5547, -0.2031, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:32,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.19 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:55:32,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.01 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.83 | step_microstep: 5.56
[2025-11-06 17:55:32,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.47 | bwd: 3.09 | bwd_inner: 2.09 | bwd_allreduce: 0.87 | step: 5.65
 11%|█         | 375/3507 [10:46<1:24:34,  1.62s/it]                                                    {'loss': 1.0117, 'learning_rate': 1.9692867818657535e-05, 'epoch': 0.11}
 11%|█         | 375/3507 [10:46<1:24:34,  1.62s/it]tensor([[-1.9297, -1.2578,  1.6719,  1.8125, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[3.8906, 4.2500, 5.3750, 5.0312, 3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:32,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.06 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6562, -2.0312,  0.6172,  0.6992, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531, -2.0625,  1.1953, -0.0684, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7266, -1.1641,  1.2344,  1.5312, -1.4453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0469, -2.1562,  1.2891, -0.1484, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4688, -1.8047,  1.0078,  0.6484, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -4.3438, -0.5898, -0.4727, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:55:33,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:55:33,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.76 | bwd_microstep: 293.50 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 292.42 | step_microstep: 2.24
[2025-11-06 17:55:33,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.85 | bwd: 294.27 | bwd_inner: 1.66 | bwd_allreduce: 292.46 | step: 2.31
 11%|█         | 376/3507 [10:47<1:09:15,  1.33s/it]                                                    {'loss': 0.6653, 'learning_rate': 1.9690591934453564e-05, 'epoch': 0.11}
 11%|█         | 376/3507 [10:47<1:09:15,  1.33s/it]tensor([[-3.8750, -3.2500, -0.3242,  0.3379, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -2.9844, -0.3926,  0.7188, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -3.4375,  0.1904,  0.2471, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9688, -1.4922,  0.8398,  1.7578, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8125, -2.2812,  0.3418,  1.3047, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1094, -2.3438,  1.0547,  0.5078, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:34,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.92 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.2500, -2.5781,  0.4902,  0.8203, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8125, -1.1875,  1.4531,  1.2734, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:55:34,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:55:34,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.32 | bwd_microstep: 216.38 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 215.07 | step_microstep: 1.81
[2025-11-06 17:55:34,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.26 | bwd: 217.29 | bwd_inner: 2.02 | bwd_allreduce: 215.11 | step: 1.92
 11%|█         | 377/3507 [10:48<1:08:01,  1.30s/it]                                                    {'loss': 0.5044, 'learning_rate': 1.9688307781558705e-05, 'epoch': 0.11}
 11%|█         | 377/3507 [10:48<1:08:01,  1.30s/it]tensor([[-2.8438, -2.1562,  0.7148,  0.6992, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.1250, -0.6914,  1.1484,  1.9453, -0.8711]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:55:34,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.12 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7656, -3.2188, -0.4844,  0.6484, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531, -2.1719,  1.1094,  0.4297, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.5938, -4.0000, -1.0156,  0.1572, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([2], device='cuda:0')
tensor([[-3.0156, -2.2812,  1.0703,  0.9492, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531, -2.4062,  0.2227,  1.0547, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3281, -1.4922,  1.7656,  0.2559, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:55:35,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 17:55:35,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.67 | bwd_microstep: 314.83 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 313.77 | step_microstep: 2.76
[2025-11-06 17:55:35,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 267.81 | bwd: 315.75 | bwd_inner: 1.79 | bwd_allreduce: 313.81 | step: 2.84
 11%|█         | 378/3507 [10:48<57:14,  1.10s/it]                                                    {'loss': 0.4583, 'learning_rate': 1.968601536192196e-05, 'epoch': 0.11}
 11%|█         | 378/3507 [10:48<57:14,  1.10s/it]tensor([[-1.0312, -0.4473,  2.0625,  2.2656, -0.8164]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-0.6953, -0.1465,  2.0938,  2.2344, -0.5117]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5312, -1.6484,  1.7734,  0.3926, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2500, -1.4219,  1.4375,  0.4980, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.3125, -1.4922,  1.5156,  0.5391, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.5781, -2.9531,  0.0212,  0.5117, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:36,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7656, -3.2812, -0.8203,  0.6016, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5781, -3.1406, -0.9336,  0.7109, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:37,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:55:37,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.63 | bwd_microstep: 1619.65 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1618.61 | step_microstep: 2.01
[2025-11-06 17:55:37,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.08 | bwd: 1620.60 | bwd_inner: 1.81 | bwd_allreduce: 1618.66 | step: 2.10
 11%|█         | 379/3507 [10:51<1:24:26,  1.62s/it]                                                    {'loss': 1.5541, 'learning_rate': 1.9683714677499385e-05, 'epoch': 0.11}
 11%|█         | 379/3507 [10:51<1:24:26,  1.62s/it]tensor([[-1.6484, -0.7852,  2.2344,  0.5977, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0781, -2.3281,  0.9609,  0.8711, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:38,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.17 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-2.9688, -2.4375,  0.0752,  0.8906, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938, -1.7891,  1.3047,  0.5703, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -3.4219,  0.2471, -0.2852, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -2.9688,  0.0535,  0.1055, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -4.0938, -1.6406,  0.1270, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.3281, -0.7188,  1.8047,  1.8672, -1.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:38,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.37 | optimizer_step: 0.35
[2025-11-06 17:55:38,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.47 | bwd_microstep: 69.68 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 68.17 | step_microstep: 3.19
[2025-11-06 17:55:38,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.67 | bwd: 70.74 | bwd_inner: 2.28 | bwd_allreduce: 68.26 | step: 3.29
 11%|█         | 380/3507 [10:52<1:06:05,  1.27s/it]                                                    {'loss': 1.0088, 'learning_rate': 1.9681405730254078e-05, 'epoch': 0.11}
 11%|█         | 380/3507 [10:52<1:06:05,  1.27s/it]tensor([[-3.0312, -2.0625,  1.4375, -0.2451, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -3.1406, -0.6602,  0.8203, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.6875, -0.1895,  0.3828, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3906, -1.4844,  1.7031,  0.1118, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5625, -1.7422,  1.5312,  0.4082, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -2.1094,  1.1250,  0.8867, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:39,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.57 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3438, -1.8828,  0.1914,  1.1250, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -4.4375, -1.3594, -0.2432, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:55:39,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.22 | optimizer_step: 0.17
[2025-11-06 17:55:39,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.87 | bwd_microstep: 46.26 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 45.16 | step_microstep: 2.07
[2025-11-06 17:55:39,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.45 | bwd: 47.17 | bwd_inner: 1.81 | bwd_allreduce: 45.21 | step: 2.16
 11%|█         | 381/3507 [10:53<1:03:57,  1.23s/it]                                                    {'loss': 0.3608, 'learning_rate': 1.9679088522156198e-05, 'epoch': 0.11}
 11%|█         | 381/3507 [10:53<1:03:57,  1.23s/it]tensor([[-2.1719, -1.5469,  0.9844,  0.7500, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:39,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.95 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.6094, -1.8359,  1.2656,  0.5391, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0938, -2.5156,  0.3105,  1.2266, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812, -2.7969, -0.3926,  1.0859, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.9219, -0.9922,  0.1387, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1250, -5.4375, -2.1094, -0.9453, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -3.1719, -0.6875,  1.0781, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8125, -1.2188,  1.3984,  1.8906, -1.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:55:39,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 17:55:39,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.35 | bwd_microstep: 86.84 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 85.70 | step_microstep: 1.84
[2025-11-06 17:55:39,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.33 | bwd: 88.07 | bwd_inner: 2.20 | bwd_allreduce: 85.74 | step: 1.93
 11%|█         | 382/3507 [10:53<51:12,  1.02it/s]                                                    {'loss': 0.5945, 'learning_rate': 1.967676305518295e-05, 'epoch': 0.11}
 11%|█         | 382/3507 [10:53<51:12,  1.02it/s]tensor([[-3.5000, -2.9375, -0.2988,  0.8672, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7812, -2.2812,  0.1963,  1.1797, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.8750, -0.4492, -0.0503, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938, -2.7812,  0.7617,  0.4590, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4922, -0.9805,  1.1953,  1.6641, -1.2422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -2.1875,  0.2617,  1.2891, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719, -1.6172,  0.9180,  1.4297, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:42,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.87 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.6250, -1.0781,  1.3203,  1.9453, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:42,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:55:42,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.71 | bwd_microstep: 2.35 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.13
[2025-11-06 17:55:42,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.59 | bwd: 3.38 | bwd_inner: 2.33 | bwd_allreduce: 0.93 | step: 2.22
 11%|█         | 383/3507 [10:56<1:14:00,  1.42s/it]                                                    {'loss': 0.4756, 'learning_rate': 1.967442933131858e-05, 'epoch': 0.11}
 11%|█         | 383/3507 [10:56<1:14:00,  1.42s/it]tensor([[-2.6562, -1.8984,  1.0859,  0.6719, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:42,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.37 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.5625, -1.9922,  0.4980,  1.0234, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9844, -0.3848,  2.0625,  1.9766, -0.7695]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -2.8906,  0.3223,  0.6211, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -4.1250, -0.7305, -0.7617, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -4.6250, -1.0703, -1.0234, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2812, -1.6172,  1.0625,  0.8984, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2188, -2.6719, -0.0952,  1.0078, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:55:43,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:55:43,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.72 | bwd_microstep: 494.16 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 493.12 | step_microstep: 1.67
[2025-11-06 17:55:43,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.12 | bwd: 495.09 | bwd_inner: 1.82 | bwd_allreduce: 493.15 | step: 1.74
 11%|█         | 384/3507 [10:57<1:05:19,  1.26s/it]                                                    {'loss': 0.8237, 'learning_rate': 1.967208735255439e-05, 'epoch': 0.11}
 11%|█         | 384/3507 [10:57<1:05:19,  1.26s/it]tensor([[-1.3672, -0.9492,  0.8828,  1.6250, -1.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0781, -2.4688,  0.2891,  1.1797, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -3.2656, -0.0254,  0.0732, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -3.7812, -1.0703,  0.0366, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0659,  0.7266,  3.2500,  1.6328,  0.0918]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1719, -1.3984,  1.7812,  1.1250, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.7969, -0.2578, -0.3145, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:44,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.53 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0781, -2.6094, -0.4043,  0.9609, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:55:44,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 17:55:44,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.06 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.73 | step_microstep: 1.92
[2025-11-06 17:55:44,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.61 | bwd: 2.75 | bwd_inner: 1.88 | bwd_allreduce: 0.76 | step: 1.99
 11%|█         | 385/3507 [10:58<1:04:39,  1.24s/it]                                                    {'loss': 0.9255, 'learning_rate': 1.966973712088872e-05, 'epoch': 0.11}
 11%|█         | 385/3507 [10:58<1:04:39,  1.24s/it]tensor([[-2.9688, -2.3594,  0.5273,  1.5625, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -1.7734,  1.3984, -0.2461, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
[2025-11-06 17:55:44,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.14 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.0938, -3.1562,  0.6562, -0.3457, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5312, -2.7500,  0.6055,  0.3848, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3125, 0.9258, 3.2656, 2.9688, 0.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0625, -2.3750,  0.6250,  0.6055, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -2.4062,  1.1953,  0.3008, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6406, -2.0156,  0.7031,  0.9922, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:55:46,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:55:46,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.47 | bwd_microstep: 1353.28 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1351.99 | step_microstep: 1.82
[2025-11-06 17:55:46,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 411.64 | bwd: 1354.18 | bwd_inner: 2.03 | bwd_allreduce: 1352.02 | step: 1.89
 11%|█         | 386/3507 [11:00<1:13:23,  1.41s/it]                                                    {'loss': 1.0732, 'learning_rate': 1.9667378638326947e-05, 'epoch': 0.11}
 11%|█         | 386/3507 [11:00<1:13:23,  1.41s/it]tensor([[-4.5938, -3.7031, -0.1553, -0.6367, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6719, -1.7812,  1.3750, -0.1084, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -5.1250, -1.5234, -0.6680, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406, -1.6797,  1.7344, -0.0234, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.4219, -2.6094,  0.5547, -0.2051, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719, -2.5000,  0.3457,  0.1152, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2656, -1.3438,  1.9609,  0.3672, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:46,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.61 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6094, -2.0625,  0.5078,  1.3438, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:47,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.21
[2025-11-06 17:55:47,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.02 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.68
[2025-11-06 17:55:47,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.64 | bwd: 3.04 | bwd_inner: 2.03 | bwd_allreduce: 0.89 | step: 1.76
 11%|█         | 387/3507 [11:01<1:04:57,  1.25s/it]                                                    {'loss': 0.9521, 'learning_rate': 1.9665011906881496e-05, 'epoch': 0.11}
 11%|█         | 387/3507 [11:01<1:04:57,  1.25s/it]tensor([[-3.2812, -2.6719,  0.1104,  0.6328, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[17:55:47] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch23/Worth_Knowing_-_videos_of_America_Palos_Park_Worth.mp4, No such file or directory
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch23/Worth_Knowing_-_videos_of_America_Palos_Park_Worth.mp4... sharegpt4v_instruct_gpt4-vision_cap100k
Traceback (most recent call last):
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 718, in __getitem__
    ret=self.video_get_item(data_item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 604, in video_get_item
    image_list,frame_indices = self.load_video(video_path)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 582, in load_video
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/miniconda3/envs/visualquality/lib/python3.11/site-packages/decord/video_reader.py", line 57, in __init__
    raise RuntimeError("Error reading " + uri + "...")
RuntimeError: Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch23/Worth_Knowing_-_videos_of_America_Palos_Park_Worth.mp4...
tensor([[-2.7344, -1.9297,  1.2031,  0.5625, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -2.0156,  0.6953,  1.0234, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6562, -2.1719,  0.1055,  1.2422, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:47,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.62 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.0312, -1.1328,  1.8047,  0.4492, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.1719, -0.2559,  2.9219,  0.9531, -0.9023]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2031, -1.5391,  1.3359,  1.7500, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8750, -1.9609,  1.3594, -0.0557, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:55:47,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 17:55:47,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.44 | bwd_microstep: 197.52 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 196.32 | step_microstep: 2.08
[2025-11-06 17:55:47,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.08 | bwd: 198.44 | bwd_inner: 1.93 | bwd_allreduce: 196.37 | step: 2.17
 11%|█         | 388/3507 [11:01<54:08,  1.04s/it]                                                    {'loss': 0.8436, 'learning_rate': 1.9662636928571827e-05, 'epoch': 0.11}
 11%|█         | 388/3507 [11:01<54:08,  1.04s/it]tensor([[-3.7656e+00, -3.0625e+00, -1.5259e-04,  3.8281e-01, -3.2344e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1562, -2.5781,  0.0034,  0.4180, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:47,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.53 | bwd_microstep: 1.33 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.4844, -1.6641,  1.5938,  0.7422, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -3.0156,  0.3066,  0.1836, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -2.9062,  0.6211, -0.1924, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -3.0156, -0.2373,  0.1357, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -1.6562,  1.2734,  0.7773, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -3.3281, -0.1992,  0.6797, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:49,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:55:49,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.58 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.87
[2025-11-06 17:55:49,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 250.12 | bwd: 3.39 | bwd_inner: 2.36 | bwd_allreduce: 0.88 | step: 1.98
 11%|█         | 389/3507 [11:02<1:00:06,  1.16s/it]                                                    {'loss': 0.6716, 'learning_rate': 1.966025370542444e-05, 'epoch': 0.11}
 11%|█         | 389/3507 [11:02<1:00:06,  1.16s/it]tensor([[1.7578, 2.0156, 2.8594, 3.3281, 1.7109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594e+00, -2.7344e+00, -7.0953e-04,  7.8906e-01, -2.8750e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8438, -3.1250, -0.0425,  0.2148, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6172, -1.0625,  1.3750,  2.0781, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:49,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.51 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.4219, -1.6172,  1.4219,  0.6484, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375, -2.8125,  0.0125,  0.5625, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -2.3281,  0.5000,  0.3516, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -3.0625,  0.0938,  0.2949, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:49,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:55:49,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.49 | bwd_microstep: 1.79 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.26
[2025-11-06 17:55:49,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.02 | bwd: 2.71 | bwd_inner: 1.85 | bwd_allreduce: 0.74 | step: 1.35
 11%|█         | 390/3507 [11:03<48:44,  1.07it/s]                                                    {'loss': 0.7627, 'learning_rate': 1.965786223947287e-05, 'epoch': 0.11}
 11%|█         | 390/3507 [11:03<48:44,  1.07it/s]tensor([[-2.4844, -1.7891,  1.1953,  1.1797, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062, -2.4219, -0.1348,  1.1641, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5391, -0.7461,  2.2969,  1.1172, -1.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7539, -0.1377,  2.2500,  2.1094, -0.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469, -1.9844,  0.5430,  1.3906, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:51,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.51 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.5781, -2.0000,  0.5820,  1.3125, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2656, -2.6719,  0.0718,  0.8398, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -3.8906, -0.6914, -0.5234, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:55:52,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:55:52,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.92 | bwd_microstep: 959.36 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 958.36 | step_microstep: 1.82
[2025-11-06 17:55:52,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.42 | bwd: 960.26 | bwd_inner: 1.68 | bwd_allreduce: 958.42 | step: 1.91
 11%|█         | 391/3507 [11:06<1:15:38,  1.46s/it]                                                    {'loss': 0.509, 'learning_rate': 1.9655462532757677e-05, 'epoch': 0.11}
 11%|█         | 391/3507 [11:06<1:15:38,  1.46s/it]tensor([[-3.7969, -3.2031, -0.4336,  0.6406, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.9375, -0.0334,  0.4648, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5938, -2.0938,  0.2793,  1.2500, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -2.8438,  0.6289,  0.1797, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:52,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.48 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0312, -2.2812,  0.7539,  0.7500, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -2.8438,  0.4473, -0.2129, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.6250, -1.6953,  1.5234,  0.0938, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2812, -2.5312,  0.4395,  0.1196, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:52,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:55:52,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.66 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.68
[2025-11-06 17:55:52,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.18 | bwd: 2.94 | bwd_inner: 1.91 | bwd_allreduce: 0.87 | step: 1.77
 11%|█         | 392/3507 [11:06<59:45,  1.15s/it]                                                    {'loss': 0.9365, 'learning_rate': 1.965305458732646e-05, 'epoch': 0.11}
 11%|█         | 392/3507 [11:06<59:45,  1.15s/it]tensor([[-3.7812, -3.1094,  0.0247,  0.5117, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -4.3125, -0.4688, -1.5859, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.6094, -2.1250,  0.0640,  1.0703, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2188, -2.3594,  0.8750,  0.3711, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:52,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.34 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6250, -3.8438, -0.3203,  0.2295, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -2.9375,  0.3438,  0.8828, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -2.6562,  0.7109, -0.1235, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5625, -2.1406, -0.1797,  1.0703, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:55:54,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.28
[2025-11-06 17:55:54,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.31 | bwd_microstep: 665.48 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 664.37 | step_microstep: 2.12
[2025-11-06 17:55:54,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.62 | bwd: 666.46 | bwd_inner: 1.90 | bwd_allreduce: 664.42 | step: 2.21
 11%|█         | 393/3507 [11:08<1:15:35,  1.46s/it]                                                    {'loss': 0.9114, 'learning_rate': 1.9650638405233852e-05, 'epoch': 0.11}
 11%|█         | 393/3507 [11:08<1:15:35,  1.46s/it]tensor([[-2.8750, -2.1406,  0.8711,  0.6562, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7969, -2.3281, -0.1338,  1.2969, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -3.4219, -0.7734,  0.5586, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406, -2.2500, -0.4258,  1.2031, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:55,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.74 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.3906, -1.5781,  1.2734,  0.5938, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2344, -1.4453,  1.6406,  0.8750, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656, -2.3125,  1.1172, -0.4805, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8906, -1.2891,  1.0391,  0.7812, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:55,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.13 | optimizer_step: 0.18
[2025-11-06 17:55:55,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.75 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.50
[2025-11-06 17:55:55,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 428.52 | bwd: 3.00 | bwd_inner: 2.09 | bwd_allreduce: 0.78 | step: 1.59
 11%|█         | 394/3507 [11:09<1:00:13,  1.16s/it]                                                    {'loss': 0.5082, 'learning_rate': 1.96482139885415e-05, 'epoch': 0.11}
 11%|█         | 394/3507 [11:09<1:00:13,  1.16s/it]tensor([[-3.4219, -2.9062, -0.4668,  0.8359, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -2.7344,  0.1455,  0.8750, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:55,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.57 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9688, -3.1406,  0.2559,  0.1006, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3906, -1.9219,  0.1953,  1.3281, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -2.5156,  0.5625,  0.7148, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -5.5000, -1.8750, -0.8203, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -4.1875, -0.7227, -1.1875, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3594, -1.4766,  1.6484,  0.1465, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:55:57,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 17:55:57,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.06 | bwd_microstep: 804.92 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 803.61 | step_microstep: 2.02
[2025-11-06 17:55:57,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.62 | bwd: 805.66 | bwd_inner: 1.89 | bwd_allreduce: 803.65 | step: 2.09
 11%|█▏        | 395/3507 [11:11<1:23:31,  1.61s/it]                                                    {'loss': 0.5493, 'learning_rate': 1.9645781339318087e-05, 'epoch': 0.11}
 11%|█▏        | 395/3507 [11:11<1:23:31,  1.61s/it]tensor([[-2.3438, -1.4922,  1.5938,  0.5508, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -2.8750, -0.2021,  0.6602, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -3.9844, -0.2393, -0.4219, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:58,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.34 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-3.0625, -2.1406,  1.4844,  0.4336, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -2.8281,  0.8047, -0.0547, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -3.0781, -0.4941,  0.6641, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -2.0156,  1.2578,  0.4473, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2344, -2.5312,  0.4082,  0.6953, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:55:58,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:55:58,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.84 | bwd_microstep: 57.16 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 55.98 | step_microstep: 1.75
[2025-11-06 17:55:58,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 438.21 | bwd: 58.03 | bwd_inner: 1.90 | bwd_allreduce: 56.01 | step: 1.82
 11%|█▏        | 396/3507 [11:12<1:06:47,  1.29s/it]                                                    {'loss': 0.4348, 'learning_rate': 1.9643340459639327e-05, 'epoch': 0.11}
 11%|█▏        | 396/3507 [11:12<1:06:47,  1.29s/it]tensor([[-3.5781, -3.0938, -0.7852,  0.8945, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.1562,  0.4668, -0.5703, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2656, -1.5234,  1.3516,  0.9180, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0156, -1.2266,  1.7188,  1.1797, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:58,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.48 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-2.4531, -1.8281,  0.8984,  1.8828, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656, -2.7500, -0.6172,  0.5781, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7734, -0.9648,  1.9844,  0.5664, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -4.1875, -0.3926, -0.7227, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:59,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:55:59,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.18 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.75
[2025-11-06 17:55:59,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 476.68 | bwd: 2.60 | bwd_inner: 1.63 | bwd_allreduce: 0.84 | step: 2.85
 11%|█▏        | 397/3507 [11:13<58:50,  1.14s/it]                                                    {'loss': 0.4562, 'learning_rate': 1.9640891351587946e-05, 'epoch': 0.11}
 11%|█▏        | 397/3507 [11:13<58:50,  1.14s/it]tensor([[-3.4375, -2.4375,  1.1875, -0.3086, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0000, -1.1719,  1.5234,  0.1963, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -2.9844, -0.1592,  0.3574, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:59,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.48 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.8438, -1.9688,  1.4219,  0.5000, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812, -2.1875,  0.5352,  1.4141, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9141, -1.0156,  2.1406,  0.9141, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -2.6250,  0.8516,  0.3789, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4844, -2.6875,  0.6719,  0.3477, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:55:59,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 17:55:59,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.79 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.03
[2025-11-06 17:55:59,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 457.31 | bwd: 2.40 | bwd_inner: 1.46 | bwd_allreduce: 0.77 | step: 2.12
 11%|█▏        | 398/3507 [11:13<49:05,  1.06it/s]                                                  {'loss': 0.4783, 'learning_rate': 1.9638434017253693e-05, 'epoch': 0.11}
 11%|█▏        | 398/3507 [11:13<49:05,  1.06it/s]tensor([[-3.0938, -2.3906,  0.6719,  1.2656, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:55:59,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.07 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -4.6562, -1.6094, -0.3516, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7031, -2.7344,  0.6484, -0.6094, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3125, -6.5938, -3.0781, -1.2500, -6.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625, -2.5000, -0.0197,  0.8789, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000, -1.9375,  0.5312,  1.3984, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9844, -2.0312,  1.3906,  0.0488, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9062, -3.3281, -0.5938,  0.5820, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:56:02,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 17:56:02,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.32 | bwd_microstep: 895.03 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 893.99 | step_microstep: 2.41
[2025-11-06 17:56:02,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.41 | bwd: 895.69 | bwd_inner: 1.51 | bwd_allreduce: 894.03 | step: 2.49
 11%|█▏        | 399/3507 [11:16<1:23:25,  1.61s/it]                                                    {'loss': 0.3271, 'learning_rate': 1.9635968458733338e-05, 'epoch': 0.11}
 11%|█▏        | 399/3507 [11:16<1:23:25,  1.61s/it]tensor([[-2.8594, -2.0312,  1.3359,  0.8438, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.4844, -0.6562, -0.4766, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:03,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.46 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7500, -2.9375,  0.3496,  0.3652, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -4.2500, -0.7852, -0.2969, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281, -2.2500,  0.2090,  0.9922, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8125, -2.0312,  1.1719,  0.5156, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -3.0312,  0.6914, -0.3281, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9062, -2.1562,  0.7773,  0.6172, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:03,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:56:03,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.41 | bwd_microstep: 544.39 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 543.33 | step_microstep: 1.62
[2025-11-06 17:56:03,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.89 | bwd: 545.31 | bwd_inner: 1.81 | bwd_allreduce: 543.38 | step: 1.70
 11%|█▏        | 400/3507 [11:17<1:12:48,  1.41s/it]                                                    {'loss': 0.5459, 'learning_rate': 1.9633494678130666e-05, 'epoch': 0.11}
 11%|█▏        | 400/3507 [11:17<1:12:48,  1.41s/it]tensor([[-3.5156, -2.9375, -0.2295,  1.1484, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9180, -0.1387,  2.4844,  1.5625, -0.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.0000, -2.4375, -0.0291,  0.9531, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:04,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.04 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6250, -2.0312,  0.5391,  1.4375, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938, -1.6875,  1.7344,  0.5938, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1875e+00, -2.2344e+00,  1.2656e+00,  6.6376e-04, -2.7188e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7188, -1.8984,  1.2266,  0.3320, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2812, -1.7266,  0.6055,  0.8242, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:56:05,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.19 | optimizer_step: 0.24
[2025-11-06 17:56:05,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.65 | bwd_microstep: 1367.41 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1366.21 | step_microstep: 2.75
[2025-11-06 17:56:05,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.71 | bwd: 1368.50 | bwd_inner: 2.11 | bwd_allreduce: 1366.26 | step: 2.82
 11%|█▏        | 401/3507 [11:19<1:18:32,  1.52s/it]                                                    {'loss': 0.7529, 'learning_rate': 1.963101267755648e-05, 'epoch': 0.11}
 11%|█▏        | 401/3507 [11:19<1:18:32,  1.52s/it]tensor([[-3.2188, -2.4844,  0.5195,  0.7852, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3281, -1.7578,  0.6250,  1.4531, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094, -2.3750,  0.5430,  0.6953, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -2.8594,  0.3828,  0.1309, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:05,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.22 | step_microstep: 0.09
tensor([[-3.4219, -2.9375, -0.7734,  0.9180, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9062, -2.2188,  0.7383,  1.1562, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -2.5000, -0.1328,  1.4453, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5391, -0.6523,  2.1719,  1.2031, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:56:07,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 17:56:07,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.31 | bwd_microstep: 2.12 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.98 | step_microstep: 2.61
[2025-11-06 17:56:07,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.08 | bwd: 3.19 | bwd_inner: 1.83 | bwd_allreduce: 1.20 | step: 2.70
 11%|█▏        | 402/3507 [11:21<1:21:08,  1.57s/it]                                                    {'loss': 0.8545, 'learning_rate': 1.96285224591286e-05, 'epoch': 0.11}
 11%|█▏        | 402/3507 [11:21<1:21:08,  1.57s/it]tensor([[-2.4219, -1.8438,  0.5391,  1.2188, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[2.7344, 3.3594, 5.2188, 3.6719, 2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -3.7188, -0.9219,  0.3594, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594, -2.3750, -0.2002,  1.3359, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:07,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.51 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.8750, -1.8906,  1.4609,  0.1309, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -2.4688,  0.4941,  0.2988, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6797, -1.0312,  1.6172,  1.9609, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6094, -0.9961,  1.5391,  2.1094, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:56:08,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:56:08,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.86 | bwd_microstep: 602.62 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 601.85 | step_microstep: 1.48
[2025-11-06 17:56:08,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.38 | bwd: 603.53 | bwd_inner: 1.49 | bwd_allreduce: 601.90 | step: 1.57
 11%|█▏        | 403/3507 [11:22<1:12:55,  1.41s/it]                                                    {'loss': 0.5024, 'learning_rate': 1.962602402497185e-05, 'epoch': 0.11}
 11%|█▏        | 403/3507 [11:22<1:12:55,  1.41s/it]tensor([[-1.9297, -1.1172,  1.8359,  1.1328, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -2.5156,  0.6992,  0.7617, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -2.5156,  0.7656,  0.8242, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:08,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.36 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.3828, -0.7422,  1.7266,  1.1484, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -3.1250,  0.3574,  0.7031, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3438, -1.7109,  0.8906,  1.4375, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3750, -0.6641,  1.9766,  1.7656, -1.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -2.6719, -0.3047,  1.2422, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:09,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:56:09,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.09 | bwd_microstep: 218.24 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 217.41 | step_microstep: 1.58
[2025-11-06 17:56:09,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.48 | bwd: 218.88 | bwd_inner: 1.31 | bwd_allreduce: 217.45 | step: 1.66
 12%|█▏        | 404/3507 [11:22<1:01:45,  1.19s/it]                                                    {'loss': 0.5947, 'learning_rate': 1.9623517377218072e-05, 'epoch': 0.12}
 12%|█▏        | 404/3507 [11:22<1:01:45,  1.19s/it]tensor([[-2.7812, -1.7812,  1.5547, -0.0708, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:09,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.37 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9062, -3.2031, -0.1836,  0.7031, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -2.3594,  1.0312,  0.7500, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.6562, -0.9688,  0.5000, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -2.8125,  0.7695,  0.5469, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -4.5312, -1.0312, -0.6797, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -2.2031,  0.8984, -0.0623, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -3.5469,  0.0051, -0.0942, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:10,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:56:10,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.92 | bwd_microstep: 71.34 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 70.48 | step_microstep: 2.19
[2025-11-06 17:56:10,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.29 | bwd: 71.98 | bwd_inner: 1.34 | bwd_allreduce: 70.51 | step: 2.26
 12%|█▏        | 405/3507 [11:24<1:01:06,  1.18s/it]                                                    {'loss': 0.4905, 'learning_rate': 1.9621002518006115e-05, 'epoch': 0.12}
 12%|█▏        | 405/3507 [11:24<1:01:06,  1.18s/it]tensor([[-2.6719, -1.9297,  1.1797,  1.5234, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1562, -2.4688,  0.3613,  0.7344, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8750, -1.2266,  1.3828,  1.8594, -1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938, -2.9062,  0.0457,  0.7148, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:10,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 0.60 | bwd_inner_microstep: 0.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.7812, -0.9609,  1.7031,  1.0312, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719, -2.6094, -0.2812,  1.1641, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.5312, -1.8906,  0.6953,  1.2344, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5156, -2.5938,  0.8789, -0.0060, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:12,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:56:12,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.45 | bwd_microstep: 781.00 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 780.11 | step_microstep: 2.22
[2025-11-06 17:56:12,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.01 | bwd: 781.60 | bwd_inner: 1.33 | bwd_allreduce: 780.14 | step: 2.29
 12%|█▏        | 406/3507 [11:26<1:13:16,  1.42s/it]                                                    {'loss': 1.0601, 'learning_rate': 1.9618479449481826e-05, 'epoch': 0.12}
 12%|█▏        | 406/3507 [11:26<1:13:16,  1.42s/it]tensor([[-2.9062, -2.1250,  0.7930,  0.4785, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.8828, 1.4766, 3.5938, 3.5625, 0.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:12,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.40 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5938, -1.7422,  1.4375,  0.5781, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -2.9688,  0.4941,  0.1348, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.5078, 1.0469, 3.0312, 3.4062, 0.5820]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5078, -0.8438,  1.7031,  1.9531, -1.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594, -2.5938,  0.5469,  0.9492, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -1.9844,  1.1016,  0.3359, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:13,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:56:13,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.06 | bwd_microstep: 601.45 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 600.05 | step_microstep: 2.38
[2025-11-06 17:56:13,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.49 | bwd: 602.32 | bwd_inner: 2.07 | bwd_allreduce: 600.08 | step: 2.46
 12%|█▏        | 407/3507 [11:27<1:07:13,  1.30s/it]                                                    {'loss': 0.5886, 'learning_rate': 1.9615948173798073e-05, 'epoch': 0.12}
 12%|█▏        | 407/3507 [11:27<1:07:13,  1.30s/it]tensor([[-2.5156, -1.5938,  1.4844,  0.3477, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7344, -2.2344, -0.0148,  1.5312, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.0312, -1.1719,  2.0000,  1.2188, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:13,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.15 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4688, -0.6016,  2.0938,  0.8125, -1.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.5469, -1.6562,  1.6016,  1.0156, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6875, -2.0938,  0.4629,  1.5078, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -2.4062,  0.3691,  1.4141, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2656, -2.3906,  1.0000, -0.0713, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:14,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 17:56:14,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.14 | bwd_microstep: 397.82 | bwd_inner_microstep: 1.56 | bwd_allreduce_microstep: 396.16 | step_microstep: 2.99
[2025-11-06 17:56:14,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.31 | bwd: 398.73 | bwd_inner: 2.37 | bwd_allreduce: 396.20 | step: 3.07
 12%|█▏        | 408/3507 [11:28<1:13:59,  1.43s/it]                                                    {'loss': 1.25, 'learning_rate': 1.9613408693114707e-05, 'epoch': 0.12}
 12%|█▏        | 408/3507 [11:28<1:13:59,  1.43s/it]tensor([[-3.2812, -2.7188, -0.3945,  0.6445, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -2.8906,  0.1729, -0.0454, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -2.6562,  0.3984,  0.5703, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:15,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.08 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.7188, -1.7891,  1.6406,  0.3457, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -3.1562,  0.2812,  0.4707, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.2266, -0.7461,  1.0703,  2.1406, -0.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.5625, -1.5938,  1.6172, -0.0713, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -2.1094,  1.2578, -0.1172, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:16,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.34 | optimizer_step: 0.35
[2025-11-06 17:56:16,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.90 | bwd_microstep: 816.86 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 815.69 | step_microstep: 2.96
[2025-11-06 17:56:16,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.99 | bwd: 817.70 | bwd_inner: 1.82 | bwd_allreduce: 815.73 | step: 3.08
 12%|█▏        | 409/3507 [11:29<1:10:05,  1.36s/it]                                                    {'loss': 0.8434, 'learning_rate': 1.96108610095986e-05, 'epoch': 0.12}
 12%|█▏        | 409/3507 [11:29<1:10:05,  1.36s/it]tensor([[-3.1406, -2.6562, -0.5352,  0.8242, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3750, -1.3828,  1.8984,  0.1299, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:16,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.85 | bwd_microstep: 1.51 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.14
tensor([[-3.2656, -2.6562, -0.0615,  1.2031, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6016,  0.0630,  2.4219,  2.0469, -0.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9219, -2.1562,  0.7500,  0.6719, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4375, -1.5391,  1.8984,  0.9531, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5625, -1.8047,  1.0391,  1.1484, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3125, -1.4375,  1.5625,  0.4668, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:18,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:56:18,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.90 | bwd_microstep: 1335.30 | bwd_inner_microstep: 2.00 | bwd_allreduce_microstep: 1333.12 | step_microstep: 1.65
[2025-11-06 17:56:18,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.75 | bwd: 1336.78 | bwd_inner: 3.35 | bwd_allreduce: 1333.17 | step: 1.79
 12%|█▏        | 410/3507 [11:31<1:19:15,  1.54s/it]                                                    {'loss': 0.5173, 'learning_rate': 1.9608305125423608e-05, 'epoch': 0.12}
 12%|█▏        | 410/3507 [11:31<1:19:15,  1.54s/it]tensor([[-1.8984, -1.1172,  1.4297,  0.9531, -1.5547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -5.3750, -2.1875, -1.1094, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:18,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.84 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2812, -3.3438,  0.4121, -0.0825, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -3.5156, -1.1875,  0.3594, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9375, -1.9922,  1.3984,  0.1592, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -3.0156, -0.7617,  0.8359, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4062, -2.7344,  0.1001,  1.0703, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -3.1562,  0.0208,  0.5703, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:19,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 17:56:19,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.81 | bwd_microstep: 959.92 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 958.78 | step_microstep: 2.96
[2025-11-06 17:56:19,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.68 | bwd: 960.87 | bwd_inner: 1.92 | bwd_allreduce: 958.82 | step: 3.03
 12%|█▏        | 411/3507 [11:33<1:16:41,  1.49s/it]                                                    {'loss': 0.3787, 'learning_rate': 1.960574104277059e-05, 'epoch': 0.12}
 12%|█▏        | 411/3507 [11:33<1:16:41,  1.49s/it]tensor([[-2.3281, -1.4219,  1.6484,  0.4648, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:19,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.16 | bwd_microstep: 1.28 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-3.5312, -2.5625,  0.9961,  0.1074, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6406, -2.1719, -0.1816,  0.8398, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6875, -1.8438,  1.3906,  0.7188, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750, -2.6094,  0.5820,  1.1094, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.6094, -0.4453,  0.8906, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -3.2969, -0.6445,  0.7266, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -3.0469, -0.2559,  0.2930, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:20,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.16 | optimizer_step: 0.24
[2025-11-06 17:56:20,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.42 | bwd_microstep: 232.51 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 231.46 | step_microstep: 2.25
[2025-11-06 17:56:20,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.61 | bwd: 233.78 | bwd_inner: 2.10 | bwd_allreduce: 231.51 | step: 2.34
 12%|█▏        | 412/3507 [11:33<1:02:58,  1.22s/it]                                                    {'loss': 0.3928, 'learning_rate': 1.9603168763827405e-05, 'epoch': 0.12}
 12%|█▏        | 412/3507 [11:33<1:02:58,  1.22s/it]tensor([[-4.8125, -4.0312, -0.7773, -0.2812, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:20,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.23 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
tensor([[-4.1562, -3.5625, -0.8398,  0.8047, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1562, -1.3672,  1.4141,  0.7695, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2188, -1.5469,  1.0547,  1.2266, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -1.8906,  1.2422,  1.1406, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7031, -2.0312,  0.7188,  1.0781, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -2.9062,  0.8711,  0.0569, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0156, -0.4609,  1.7500,  2.6094, -0.7539]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:22,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 17:56:22,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 2010.95 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2009.70 | step_microstep: 2.24
[2025-11-06 17:56:22,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.56 | bwd: 2011.92 | bwd_inner: 1.95 | bwd_allreduce: 2009.78 | step: 2.34
 12%|█▏        | 413/3507 [11:36<1:21:04,  1.57s/it]                                                    {'loss': 0.5625, 'learning_rate': 1.9600588290788898e-05, 'epoch': 0.12}
 12%|█▏        | 413/3507 [11:36<1:21:04,  1.57s/it]tensor([[-3.7344, -3.1562, -0.6055,  0.6250, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -2.9844,  0.3672,  0.0757, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2812, -6.3438, -2.4062, -2.0312, -6.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:22,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.38 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.1094, -1.5234,  0.8867,  1.9141, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -2.5156,  0.0796,  0.7500, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -2.0938,  0.6328,  1.5547, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -3.5312, -0.5898,  0.3672, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.6875, -0.8164,  0.2441, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:22,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 17:56:22,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.93 | bwd_microstep: 2.60 | bwd_inner_microstep: 1.47 | bwd_allreduce_microstep: 1.03 | step_microstep: 1.89
[2025-11-06 17:56:22,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.32 | bwd: 3.36 | bwd_inner: 2.14 | bwd_allreduce: 1.06 | step: 1.97
 12%|█▏        | 414/3507 [11:36<1:03:08,  1.23s/it]                                                    {'loss': 0.4653, 'learning_rate': 1.959799962585691e-05, 'epoch': 0.12}
 12%|█▏        | 414/3507 [11:36<1:03:08,  1.23s/it]tensor([[-3.7656, -2.8281,  0.5742, -0.6641, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4219, -2.5781,  0.7812,  0.5703, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -3.7656, -0.9609,  0.1592, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:23,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.35 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.9141, -0.9336,  2.2812,  0.1807, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219, -2.5469,  0.8320,  0.4609, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -2.2656,  0.3301,  1.3828, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -3.3125, -0.8633,  0.6172, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7812, -5.8125, -1.6016, -1.0547, -5.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:24,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:56:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 772.70 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 771.41 | step_microstep: 3.19
[2025-11-06 17:56:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.00 | bwd: 773.81 | bwd_inner: 2.23 | bwd_allreduce: 771.45 | step: 3.29
 12%|█▏        | 415/3507 [11:37<1:02:25,  1.21s/it]                                                    {'loss': 0.4977, 'learning_rate': 1.959540277124027e-05, 'epoch': 0.12}
 12%|█▏        | 415/3507 [11:37<1:02:25,  1.21s/it]tensor([[-3.7969, -3.2656, -0.9023,  0.7266, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094, -1.7266,  1.4766,  0.5508, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4844, -1.6250,  1.3359,  0.5625, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:24,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.27 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3594, -1.5312,  1.5703,  0.8516, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438, -2.2812,  0.1445,  1.1875, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4766, -0.4941,  2.6094,  0.9922, -1.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5312, -1.6875,  1.3281,  1.1484, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.8125, -0.3770, -0.3418, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:25,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:56:25,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.50 | bwd_microstep: 3.71 | bwd_inner_microstep: 2.75 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.65
[2025-11-06 17:56:25,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.73 | bwd: 4.77 | bwd_inner: 3.71 | bwd_allreduce: 0.89 | step: 2.73
 12%|█▏        | 416/3507 [11:38<1:00:23,  1.17s/it]                                                    {'loss': 0.4635, 'learning_rate': 1.9592797729154796e-05, 'epoch': 0.12}
 12%|█▏        | 416/3507 [11:38<1:00:23,  1.17s/it]tensor([[-6.3750, -5.4375, -1.5625, -1.4141, -5.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:25,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.19 | bwd_microstep: 1.75 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.17
tensor([[-4.7812, -3.9219, -0.4414, -0.0762, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7812, -1.7188,  1.5469, -0.1758, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8125, -2.7812,  0.9609, -0.4434, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688, -1.8594,  0.6016,  1.2891, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5859, -0.8711,  1.9375,  1.7422, -1.2734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -4.5000, -0.7344, -0.2832, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4062, -2.6719,  0.3887,  0.6680, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:26,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 17:56:26,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.61 | bwd_microstep: 1082.57 | bwd_inner_microstep: 2.25 | bwd_allreduce_microstep: 1080.10 | step_microstep: 2.76
[2025-11-06 17:56:26,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 469.86 | bwd: 1084.29 | bwd_inner: 3.77 | bwd_allreduce: 1080.18 | step: 2.94
 12%|█▏        | 417/3507 [11:40<1:07:06,  1.30s/it]                                                    {'loss': 0.9431, 'learning_rate': 1.959018450182329e-05, 'epoch': 0.12}
 12%|█▏        | 417/3507 [11:40<1:07:06,  1.30s/it]tensor([[-4.1562, -3.4375, -0.3809,  0.3770, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9062, -1.1875,  1.3125,  0.7539, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -3.2031, -0.8086,  0.6641, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1250, -0.7227,  0.9648,  2.5000, -0.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:27,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.39 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.7734, -0.9492,  1.5469,  0.0957, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1875, -1.5000,  0.9883,  1.1875, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4219, -1.9219,  0.2285,  1.3438, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9375, -2.3125,  0.2188,  1.1641, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:56:27,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 17:56:27,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.18 | bwd_microstep: 7.47 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 6.18 | step_microstep: 1.65
[2025-11-06 17:56:27,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.60 | bwd: 8.28 | bwd_inner: 1.94 | bwd_allreduce: 6.21 | step: 1.73
 12%|█▏        | 418/3507 [11:41<53:19,  1.04s/it]                                                    {'loss': 0.5017, 'learning_rate': 1.958756309147555e-05, 'epoch': 0.12}
 12%|█▏        | 418/3507 [11:41<53:19,  1.04s/it]tensor([[-2.3281, -1.8516,  0.2793,  1.9609, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -6.1875, -3.1719, -1.1484, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3281, -1.3750,  1.9219,  0.5078, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.7656, -0.1582, -0.3184, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6719, -1.7734,  1.3906,  0.4707, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:27,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.33 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.5781, -2.6406,  1.1250,  0.4746, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9883, -0.3359,  2.0469,  2.2969, -0.7227]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.0312,  1.6016, -0.1602, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:29,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.69 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 17:56:29,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.06 | bwd_microstep: 1259.69 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1258.45 | step_microstep: 3.85
[2025-11-06 17:56:29,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.41 | bwd: 1260.60 | bwd_inner: 1.96 | bwd_allreduce: 1258.50 | step: 3.94
 12%|█▏        | 419/3507 [11:43<1:10:26,  1.37s/it]                                                    {'loss': 0.3733, 'learning_rate': 1.958493350034834e-05, 'epoch': 0.12}
 12%|█▏        | 419/3507 [11:43<1:10:26,  1.37s/it]tensor([[-4.4062, -3.6094, -0.3965, -0.2412, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:29,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.90 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.2031, -1.5469,  1.0000,  1.4297, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -2.7656,  0.2178,  0.7500, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000, -1.8125,  0.9570,  1.4141, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -3.6875,  0.1777, -0.0154, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -2.9844,  0.5312,  0.6719, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9844, -1.2734,  1.1719,  0.6445, -1.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.7031, -0.7266,  0.3496, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:56:29,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 17:56:29,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.37 | bwd_microstep: 81.68 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 80.65 | step_microstep: 1.98
[2025-11-06 17:56:29,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.28 | bwd: 82.66 | bwd_inner: 1.81 | bwd_allreduce: 80.69 | step: 2.08
 12%|█▏        | 420/3507 [11:43<55:45,  1.08s/it]                                                    {'loss': 0.668, 'learning_rate': 1.9582295730685406e-05, 'epoch': 0.12}
 12%|█▏        | 420/3507 [11:43<55:45,  1.08s/it]tensor([[-5.2812, -4.4688, -0.9023, -0.4141, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -3.1094,  0.0613,  0.7578, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:29,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.33 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -3.5312,  0.0505,  0.0845, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781, -2.0781,  0.0566,  1.4297, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.7500, -0.6406,  0.2158, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8906, -1.0312,  1.6172,  0.6797, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.7188, -5.6562, -1.4453, -1.9922, -5.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[h264 @ 0xcf77ec0] mmco: unref short failure
tensor([[-1.4062, -0.5977,  2.0938,  1.4922, -1.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:32,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.41 | optimizer_step: 0.57
[2025-11-06 17:56:32,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.29 | bwd_microstep: 2093.48 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 2092.34 | step_microstep: 4.30
[2025-11-06 17:56:32,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.64 | bwd: 2094.25 | bwd_inner: 1.64 | bwd_allreduce: 2092.43 | step: 4.38
 12%|█▏        | 421/3507 [11:46<1:21:00,  1.57s/it]                                                    {'loss': 0.8784, 'learning_rate': 1.9579649784737484e-05, 'epoch': 0.12}
 12%|█▏        | 421/3507 [11:46<1:21:00,  1.57s/it]tensor([[-3.2656, -2.6875, -0.1025,  1.0625, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -3.2969, -0.9766,  0.9375, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -2.5469, -0.1143,  1.1719, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:32,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.42 | bwd_microstep: 1.88 | bwd_inner_microstep: 1.60 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.19
tensor([[-2.1562, -1.4531,  0.9297,  0.7266, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -4.1875, -0.5977, -0.5703, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.8281, -3.0156,  0.3711,  0.8008, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.7812, -2.1406,  0.4707,  0.9922, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([3], device='cuda:1')
tensor([3], device='cuda:0')
tensor([[1.1250, 1.8594, 3.9219, 2.6562, 1.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:56:32,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.25
[2025-11-06 17:56:32,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 32.50 | bwd_inner_microstep: 1.75 | bwd_allreduce_microstep: 30.61 | step_microstep: 2.16
[2025-11-06 17:56:32,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.86 | bwd: 34.38 | bwd_inner: 3.37 | bwd_allreduce: 30.71 | step: 2.35
 12%|█▏        | 422/3507 [11:46<1:03:31,  1.24s/it]                                                    {'loss': 0.4596, 'learning_rate': 1.957699566476228e-05, 'epoch': 0.12}
 12%|█▏        | 422/3507 [11:46<1:03:31,  1.24s/it]tensor([[-4.1562, -3.3750, -0.3691, -0.1914, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844, -2.2969,  0.5156,  1.2656, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:33,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.84 | bwd_microstep: 1.65 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15
tensor([[-2.7188, -2.1719,  0.1089,  1.0078, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -5.2812, -1.0391, -0.9805, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -2.3125,  1.0312,  0.5391, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -2.6250,  0.1631,  1.0391, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -2.8594,  0.5430,  0.9805, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8750, -0.8359,  2.6094,  0.5391, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:34,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.50 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 17:56:34,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.87 | bwd_microstep: 1185.13 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1183.97 | step_microstep: 4.40
[2025-11-06 17:56:34,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 447.56 | bwd: 1186.77 | bwd_inner: 2.55 | bwd_allreduce: 1184.05 | step: 4.54
 12%|█▏        | 423/3507 [11:48<1:15:07,  1.46s/it]                                                    {'loss': 0.4961, 'learning_rate': 1.9574333373024474e-05, 'epoch': 0.12}
 12%|█▏        | 423/3507 [11:48<1:15:07,  1.46s/it]tensor([[-3.0000, -2.3125,  0.3887,  0.8320, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6328, -1.1172,  0.8633,  2.1719, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:56:35,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.96 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-3.2812, -2.6719, -0.0092,  1.0938, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9062, -3.0312,  0.5508,  0.4727, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -3.0625,  0.5547,  0.5625, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -4.2188, -1.6328,  0.7305, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -3.1250,  0.2168,  0.2031, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0312, -2.0000,  1.3828, -0.2715, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:35,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.86 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 17:56:35,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.28 | bwd_microstep: 97.94 | bwd_inner_microstep: 3.31 | bwd_allreduce_microstep: 94.48 | step_microstep: 4.65
[2025-11-06 17:56:35,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.28 | bwd: 98.84 | bwd_inner: 4.16 | bwd_allreduce: 94.50 | step: 4.73
 12%|█▏        | 424/3507 [11:49<1:00:52,  1.18s/it]                                                    {'loss': 0.8868, 'learning_rate': 1.9571662911795718e-05, 'epoch': 0.12}
 12%|█▏        | 424/3507 [11:49<1:00:52,  1.18s/it]tensor([[-1.3750, -0.7109,  1.8594,  2.5469, -1.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9570, -0.0596,  2.6250,  1.3906, -0.6992]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:35,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.51 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.2656, -2.7656, -0.6133,  1.2344, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5469, -1.7578,  1.0312,  0.7969, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1406, -2.0469,  1.5938, -0.2871, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.1562, -2.1719,  0.9258,  0.1572, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-1.0781, -0.4473,  1.6094,  1.6094, -0.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -3.8438, -0.5312,  0.8008, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:36,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 17:56:36,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.46 | bwd_microstep: 894.20 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 892.98 | step_microstep: 2.87
[2025-11-06 17:56:36,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.96 | bwd: 895.22 | bwd_inner: 1.99 | bwd_allreduce: 893.04 | step: 2.99
 12%|█▏        | 425/3507 [11:50<1:02:30,  1.22s/it]                                                    {'loss': 1.3473, 'learning_rate': 1.9568984283354637e-05, 'epoch': 0.12}
 12%|█▏        | 425/3507 [11:50<1:02:30,  1.22s/it]tensor([[-2.4375, -1.9453,  0.0625,  1.3828, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -2.3125,  1.3594, -0.0679, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:36,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.61 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.7031, -1.6875,  1.7891, -0.1069, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.8906, -0.2295,  0.1934, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8750, -1.9609,  1.1406,  0.0143, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0312, -3.4844, -1.0234,  0.6836, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.1250, -3.2031,  0.5938,  0.1592, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7422, -0.8633,  2.0469,  1.1094, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:56:37,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 17:56:37,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.63 | bwd_microstep: 112.17 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 110.94 | step_microstep: 3.34
[2025-11-06 17:56:37,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.27 | bwd: 113.13 | bwd_inner: 1.89 | bwd_allreduce: 111.02 | step: 3.44
 12%|█▏        | 426/3507 [11:51<51:37,  1.01s/it]                                                    {'loss': 0.855, 'learning_rate': 1.9566297489986826e-05, 'epoch': 0.12}
 12%|█▏        | 426/3507 [11:51<51:37,  1.01s/it]tensor([[-3.4531, -2.6875,  0.3926,  1.0703, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:37,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.88 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3125, -2.4375,  1.0625,  0.7500, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9531, -2.2812,  0.3477,  1.0547, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -2.9375,  0.4453,  0.5547, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -3.3594, -0.6758,  0.7383, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -2.3125,  1.3828, -0.1777, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -4.0000, -0.1797, -0.2080, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3438, -3.5625, -0.4082,  0.0454, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:40,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 17:56:40,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.24 | bwd_microstep: 2277.39 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 2276.21 | step_microstep: 2.06
[2025-11-06 17:56:40,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.14 | bwd: 2278.36 | bwd_inner: 1.94 | bwd_allreduce: 2276.26 | step: 2.15
 12%|█▏        | 427/3507 [11:53<1:19:53,  1.56s/it]                                                    {'loss': 0.5729, 'learning_rate': 1.9563602533984843e-05, 'epoch': 0.12}
 12%|█▏        | 427/3507 [11:53<1:19:53,  1.56s/it]tensor([[-2.1875, -1.6562,  0.5508,  1.9141, -1.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:40,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.86 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.3438, -2.5469,  0.7031,  1.2266, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0938, -1.1641,  1.7812,  0.4453, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -1.7422,  1.9375,  0.0479, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6094, -2.0781,  0.1738,  1.7656, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9688, -2.3906,  0.0302,  1.0703, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3438, -1.5938,  1.1250,  0.9297, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2500, -1.5312,  1.2031,  1.7734, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:56:40,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:56:40,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.42 | bwd_microstep: 198.36 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 197.21 | step_microstep: 1.72
[2025-11-06 17:56:40,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.31 | bwd: 199.31 | bwd_inner: 1.94 | bwd_allreduce: 197.24 | step: 1.79
 12%|█▏        | 428/3507 [11:54<1:04:53,  1.26s/it]                                                    {'loss': 0.6023, 'learning_rate': 1.9560899417648214e-05, 'epoch': 0.12}
 12%|█▏        | 428/3507 [11:54<1:04:53,  1.26s/it]tensor([[-2.8281, -2.0312,  1.0234,  1.1484, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1875, -2.2500,  1.3359,  0.8672, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:40,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.01 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0781, -2.4062,  0.3438,  1.0312, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -1.8984,  1.6328,  0.3965, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -2.7344, -0.8438,  1.0938, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7188, -0.9844,  1.7188,  1.6484, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -3.5781,  0.1206,  0.3594, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -2.3750,  0.7266,  0.6953, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:42,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.30 | optimizer_step: 0.34
[2025-11-06 17:56:42,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.48 | bwd_microstep: 1784.50 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 1783.05 | step_microstep: 3.19
[2025-11-06 17:56:42,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.52 | bwd: 1785.43 | bwd_inner: 2.17 | bwd_allreduce: 1783.11 | step: 3.28
 12%|█▏        | 429/3507 [11:56<1:19:56,  1.56s/it]                                                    {'loss': 0.5996, 'learning_rate': 1.9558188143283425e-05, 'epoch': 0.12}
 12%|█▏        | 429/3507 [11:56<1:19:56,  1.56s/it]tensor([[-2.4844, -1.5078,  1.6797,  0.9375, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1719, -2.6719, -0.4688,  0.8906, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -3.1562, -0.4258,  0.7461, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -2.6875,  0.1152,  0.7383, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:43,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.08 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.0938, -1.6016,  0.4121,  1.7422, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8438, -2.9062,  0.7578,  0.1895, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062, -1.7656,  0.8867,  2.0625, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656e+00, -2.2500e+00,  2.2583e-03,  1.6172e+00, -2.2812e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:56:43,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:56:43,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.31 | bwd_microstep: 176.68 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 175.53 | step_microstep: 1.64
[2025-11-06 17:56:43,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.41 | bwd: 177.66 | bwd_inner: 1.96 | bwd_allreduce: 175.57 | step: 1.73
 12%|█▏        | 430/3507 [11:57<1:04:07,  1.25s/it]                                                    {'loss': 0.5717, 'learning_rate': 1.955546871320393e-05, 'epoch': 0.12}
 12%|█▏        | 430/3507 [11:57<1:04:07,  1.25s/it]tensor([[-3.5625, -2.7656,  0.4746,  1.1875, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -3.1406, -0.5039,  0.8789, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969, -2.5000,  0.6289,  0.5469, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1953, -0.5391,  1.8828,  2.5781, -0.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:43,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.62 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8125, -2.3281, -0.1465,  1.6016, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6406, -2.1406, -0.0099,  1.3047, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -3.6406, -0.2246, -0.2451, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4531, -1.7891,  0.8438,  1.4609, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:56:45,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 17:56:45,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.57 | bwd_microstep: 910.33 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 909.16 | step_microstep: 2.43
[2025-11-06 17:56:45,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.21 | bwd: 911.32 | bwd_inner: 1.98 | bwd_allreduce: 909.21 | step: 2.52
 12%|█▏        | 431/3507 [11:58<1:10:03,  1.37s/it]                                                    {'loss': 0.5343, 'learning_rate': 1.9552741129730132e-05, 'epoch': 0.12}
 12%|█▏        | 431/3507 [11:58<1:10:03,  1.37s/it]tensor([[-1.9062, -1.3750,  0.7383,  2.0156, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:45,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.14 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5469, -2.7656,  0.3164,  0.6680, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -2.5156,  0.9102,  0.5039, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5781, -2.1406, -0.2275,  1.7266, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -4.4375, -0.1865, -0.3965, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719, -2.4375,  0.3555,  0.6016, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.2969,  0.5625, -0.2988, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562, -2.6250,  1.1328, -0.3730, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:56:45,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:56:45,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 267.18 | bwd_microstep: 110.18 | bwd_inner_microstep: 1.84 | bwd_allreduce_microstep: 108.23 | step_microstep: 1.62
[2025-11-06 17:56:45,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 445.37 | bwd: 111.18 | bwd_inner: 2.76 | bwd_allreduce: 108.27 | step: 1.70
 12%|█▏        | 432/3507 [11:59<58:12,  1.14s/it]                                                    {'loss': 0.512, 'learning_rate': 1.9550005395189393e-05, 'epoch': 0.12}
 12%|█▏        | 432/3507 [11:59<58:12,  1.14s/it]tensor([[-2.0781, -1.3594,  1.0703,  0.7383, -1.7109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3594, -0.7461,  1.5156,  2.2188, -1.0391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3359, -0.6875,  1.7578,  2.3438, -1.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3125, -2.4844,  0.8164,  0.8828, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688, -2.1562,  0.9258,  0.9648, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1250, -2.0781,  1.5625, -0.0757, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -3.1562,  0.3594, -0.0618, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:47,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.32 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6094e+00, -1.5391e+00,  1.9453e+00,  6.2561e-04, -2.2031e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:48,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.34 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:56:48,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.17 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.97
[2025-11-06 17:56:48,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.48 | bwd: 3.04 | bwd_inner: 2.05 | bwd_allreduce: 0.86 | step: 4.05
 12%|█▏        | 433/3507 [12:02<1:20:23,  1.57s/it]                                                    {'loss': 0.5133, 'learning_rate': 1.9547261511916042e-05, 'epoch': 0.12}
 12%|█▏        | 433/3507 [12:02<1:20:23,  1.57s/it]tensor([[-2.5625, -2.1562, -0.4121,  1.2188, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -2.8594,  0.8125, -0.0236, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -5.0000, -1.7109, -0.2305, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4219, -1.5938,  1.4062,  0.9570, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:48,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.47 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9531, -1.9453,  1.5078,  0.5469, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -2.3281,  1.3203,  0.2432, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -4.6250, -1.7188, -0.5703, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -2.6406,  1.0234,  0.5391, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:48,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:56:48,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.20 | bwd_microstep: 2.52 | bwd_inner_microstep: 1.62 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.71
[2025-11-06 17:56:48,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 500.69 | bwd: 3.33 | bwd_inner: 2.35 | bwd_allreduce: 0.86 | step: 1.79
 12%|█▏        | 434/3507 [12:02<1:04:40,  1.26s/it]                                                    {'loss': 0.3679, 'learning_rate': 1.9544509482251344e-05, 'epoch': 0.12}
 12%|█▏        | 434/3507 [12:02<1:04:40,  1.26s/it]tensor([[-2.9844, -2.5000, -0.3750,  1.2969, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:48,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.67 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.6562, -1.9297,  0.8203,  1.2031, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1250, -2.1250,  1.2734,  0.3066, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5898,  0.1270,  2.5156,  2.5625, -0.3457]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2656, -1.7344,  0.4121,  1.7188, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6719, -1.7266,  1.5859,  0.1787, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8047, -0.7852,  2.1406,  0.2871, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9219, -3.0469,  0.5234,  0.5859, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:51,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.36 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:56:51,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.03 | bwd_microstep: 1.77 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.81 | step_microstep: 4.04
[2025-11-06 17:56:51,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.72 | bwd: 2.73 | bwd_inner: 1.77 | bwd_allreduce: 0.84 | step: 4.11
 12%|█▏        | 435/3507 [12:05<1:24:48,  1.66s/it]                                                    {'loss': 0.848, 'learning_rate': 1.9541749308543535e-05, 'epoch': 0.12}
 12%|█▏        | 435/3507 [12:05<1:24:48,  1.66s/it]tensor([[-2.6406, -2.0625,  0.3379,  1.5391, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.7344, -0.8828,  0.2168, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.2656, -2.5469,  0.3594,  0.9180, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000, -1.7109,  1.2266,  1.1875, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:51,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.69 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0469, -1.3906,  1.0391,  1.0391, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6094, -0.7266,  1.8750,  0.5195, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5312, -1.9141,  0.5547,  1.8047, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7969, -2.0156,  0.7148,  0.8477, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:56:51,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.81 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 17:56:51,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.91 | bwd_microstep: 34.82 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 33.62 | step_microstep: 5.98
[2025-11-06 17:56:51,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.63 | bwd: 35.47 | bwd_inner: 1.64 | bwd_allreduce: 33.66 | step: 6.06
 12%|█▏        | 436/3507 [12:05<1:05:50,  1.29s/it]                                                    {'loss': 0.9966, 'learning_rate': 1.9538980993147773e-05, 'epoch': 0.12}
 12%|█▏        | 436/3507 [12:05<1:05:50,  1.29s/it]tensor([[-5.0312, -4.3125, -1.1562, -0.2754, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750, -2.7500, -0.1299,  1.2500, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -2.6562,  0.2539,  0.9609, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -2.1719,  0.1084,  1.7500, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:56:52,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.50 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.6719, -0.9375,  1.7344,  1.8672, -1.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0469, -2.5000, -0.0674,  1.3750, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -2.9688,  0.4375,  0.5781, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5781, -1.8516,  0.9023,  1.2891, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:54,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.23 | optimizer_step: 0.41
[2025-11-06 17:56:54,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.16 | bwd_microstep: 3.42 | bwd_inner_microstep: 1.69 | bwd_allreduce_microstep: 1.57 | step_microstep: 2.70
[2025-11-06 17:56:54,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.67 | bwd: 4.52 | bwd_inner: 2.68 | bwd_allreduce: 1.62 | step: 2.81
 12%|█▏        | 437/3507 [12:08<1:32:04,  1.80s/it]                                                    {'loss': 1.0093, 'learning_rate': 1.9536204538426185e-05, 'epoch': 0.12}
 12%|█▏        | 437/3507 [12:08<1:32:04,  1.80s/it]tensor([[-3.6875, -3.1562, -0.7812,  0.7148, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094, -2.0312,  0.3320,  1.5547, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[2.7344, 3.4844, 5.1562, 3.9531, 2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([2], device='cuda:3')
[2025-11-06 17:56:55,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.66 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7969, -2.0781,  0.7773,  1.2266, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.2812, -0.1221,  0.0300, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -4.3438, -1.6797,  0.1147, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9844, -2.1250,  1.0000,  0.6680, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500, -2.1875,  0.2383,  1.4844, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:56:55,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:56:55,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.11 | bwd_microstep: 91.28 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 90.23 | step_microstep: 2.35
[2025-11-06 17:56:55,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.78 | bwd: 92.23 | bwd_inner: 1.84 | bwd_allreduce: 90.27 | step: 2.42
 12%|█▏        | 438/3507 [12:09<1:11:34,  1.40s/it]                                                    {'loss': 0.4928, 'learning_rate': 1.953341994674784e-05, 'epoch': 0.12}
 12%|█▏        | 438/3507 [12:09<1:11:34,  1.40s/it]tensor([[-1.6016, -0.6367,  2.3281,  1.3906, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -2.4062,  0.5195,  1.2422, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5391, -0.8906,  1.5234,  1.9922, -1.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625, -1.3047,  1.4844,  1.8984, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:55,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.35 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0312, -2.0312,  1.4766,  0.0776, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1719, -1.1953,  1.7031,  0.1455, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312, -2.6875,  0.7344,  0.8594, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -1.8047,  1.7969,  0.4160, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:57,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.44 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 17:56:57,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.86 | bwd_microstep: 2.25 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.94 | step_microstep: 4.19
[2025-11-06 17:56:57,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.23 | bwd: 3.18 | bwd_inner: 2.07 | bwd_allreduce: 0.98 | step: 4.28
 13%|█▎        | 439/3507 [12:11<1:26:25,  1.69s/it]                                                    {'loss': 0.5183, 'learning_rate': 1.9530627220488744e-05, 'epoch': 0.13}
 13%|█▎        | 439/3507 [12:11<1:26:25,  1.69s/it]tensor([[-3.1250, -1.8984,  1.7891, -0.5156, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.9688, -2.2188,  0.6328,  0.9062, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -2.7188,  0.8125, -0.0913, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0156, -2.2031,  0.8438,  0.2490, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -3.5000, -0.0547, -0.0466, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -2.9062,  0.6602,  0.7344, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:58,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 335.11 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2656, -2.2344,  1.1719, -0.0205, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.5312, -0.2314, -0.3750, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:56:58,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:56:58,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.98 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 2.76
[2025-11-06 17:56:58,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 505.11 | bwd: 2.66 | bwd_inner: 1.78 | bwd_allreduce: 0.77 | step: 2.83
 13%|█▎        | 440/3507 [12:12<1:08:56,  1.35s/it]                                                    {'loss': 1.0, 'learning_rate': 1.9527826362031847e-05, 'epoch': 0.13}
 13%|█▎        | 440/3507 [12:12<1:08:56,  1.35s/it]tensor([[-3.7656, -3.0938, -0.1060,  1.2891, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:56:58,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.51 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.4062, -4.5312, -0.6562,  0.2051, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3516, -0.4492,  2.5156,  1.3984, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7734, -1.0703,  1.3203,  1.0625, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8672, -1.2734,  1.0000,  1.8828, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -1.8281,  1.8047,  0.5586, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -3.7656, -0.6719,  0.3906, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4375, -1.4062,  1.9375,  0.3828, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:00,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.38 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:57:00,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 2.17 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.85
[2025-11-06 17:57:00,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.62 | bwd: 3.15 | bwd_inner: 2.18 | bwd_allreduce: 0.86 | step: 3.92
 13%|█▎        | 441/3507 [12:14<1:19:59,  1.57s/it]                                                    {'loss': 0.3981, 'learning_rate': 1.952501737376703e-05, 'epoch': 0.13}
 13%|█▎        | 441/3507 [12:14<1:19:59,  1.57s/it]tensor([[-2.2188, -1.6172,  0.7148,  1.7969, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125, -1.5312,  1.2031,  0.8828, -1.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8125, -1.7969,  1.2969,  0.1963, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:57:00,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.3125, -1.7578,  0.5703,  1.8828, -1.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8281, -1.0469,  1.4297,  1.4375, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594, -2.6562,  0.2617,  0.9102, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0000, -2.0625,  1.3516,  0.8320, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -3.5156, -1.3047,  0.2910, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:57:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.58 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:57:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 164.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 163.03 | step_microstep: 4.08
[2025-11-06 17:57:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.17 | bwd: 165.04 | bwd_inner: 1.84 | bwd_allreduce: 163.07 | step: 4.17
 13%|█▎        | 442/3507 [12:14<1:05:00,  1.27s/it]                                                    {'loss': 0.9813, 'learning_rate': 1.952220025809113e-05, 'epoch': 0.13}
 13%|█▎        | 442/3507 [12:14<1:05:00,  1.27s/it]tensor([[-4.7812, -3.9688, -0.6719, -0.1025, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -3.0781,  0.6406,  0.1953, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:01,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.99 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.9531, -2.8594,  0.9922, -0.4473, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9219, -1.8984,  1.4609, -0.3496, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4219, -1.3438,  1.8984,  0.4902, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188, -2.5938,  0.0165,  0.9062, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5664, -0.1670,  1.3672,  2.6875, -0.3008]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0781, -2.1094,  1.3984,  0.6328, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:57:01,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:57:01,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.86 | bwd_microstep: 51.52 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 50.46 | step_microstep: 1.47
[2025-11-06 17:57:01,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.88 | bwd: 52.40 | bwd_inner: 1.76 | bwd_allreduce: 50.50 | step: 1.55
 13%|█▎        | 443/3507 [12:15<52:27,  1.03s/it]                                                    {'loss': 0.4576, 'learning_rate': 1.9519375017407896e-05, 'epoch': 0.13}
 13%|█▎        | 443/3507 [12:15<52:27,  1.03s/it]tensor([[-2.2812, -1.1328,  2.2188,  0.0649, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:01,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.22 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3438, -3.2031,  0.7539, -0.5742, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.7656,  0.2539, -0.7227, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -3.6250, -1.1562,  0.6758, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2031, -1.3047,  1.8828,  1.2812, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -3.3281,  0.0742,  0.5781, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -3.0781, -1.0625,  0.7656, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.5469, -3.0781, -0.9688,  1.1016, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:03,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:57:03,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.63 | bwd_microstep: 1616.38 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 1614.95 | step_microstep: 1.62
[2025-11-06 17:57:03,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.90 | bwd: 1617.33 | bwd_inner: 2.13 | bwd_allreduce: 1614.99 | step: 1.72
 13%|█▎        | 444/3507 [12:17<1:08:13,  1.34s/it]                                                    {'loss': 0.7488, 'learning_rate': 1.951654165412803e-05, 'epoch': 0.13}
 13%|█▎        | 444/3507 [12:17<1:08:13,  1.34s/it]tensor([[-2.1094, -1.1562,  1.6328,  0.2119, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:03,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.71 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9375, -2.2344,  0.6016,  1.4844, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6562, -1.8125,  1.0781,  0.8008, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6562, -1.8359,  1.1875,  1.3203, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2188, -1.3438,  1.6250,  1.2188, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0625, -1.5312,  0.4922,  1.7969, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0312, -1.6016,  0.1455,  1.6094, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -1.8281,  1.8906,  0.0708, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:03,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 17:57:03,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.57 | bwd_microstep: 151.86 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 150.83 | step_microstep: 1.67
[2025-11-06 17:57:03,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.29 | bwd: 152.79 | bwd_inner: 1.78 | bwd_allreduce: 150.87 | step: 1.76
 13%|█▎        | 445/3507 [12:17<55:09,  1.08s/it]                                                    {'loss': 0.4868, 'learning_rate': 1.9513700170669152e-05, 'epoch': 0.13}
 13%|█▎        | 445/3507 [12:17<55:09,  1.08s/it]tensor([[-2.7656, -2.1562,  0.3105,  1.2109, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -1.6953,  1.9141,  0.3516, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:04,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.83 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4375, -4.4375, -0.3945, -0.4629, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -2.6250,  0.9180, -0.2354, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.2812, -1.7969,  0.0645,  1.4531, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.4844, -2.9219, -0.5273,  0.6836, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9375, -2.3281,  0.1143,  0.8711, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0781, -2.3281,  0.7070,  1.3594, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:06,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 17:57:06,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.86 | bwd_microstep: 1771.81 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1770.86 | step_microstep: 2.87
[2025-11-06 17:57:06,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.72 | bwd: 1772.79 | bwd_inner: 1.76 | bwd_allreduce: 1770.91 | step: 2.96
 13%|█▎        | 446/3507 [12:20<1:16:41,  1.50s/it]                                                    {'loss': 1.4357, 'learning_rate': 1.9510850569455815e-05, 'epoch': 0.13}
 13%|█▎        | 446/3507 [12:20<1:16:41,  1.50s/it]tensor([[-1.1172, -0.0031,  3.0312,  0.7109, -0.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:06,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.76 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.3594, -2.7656, -0.2109,  0.9336, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -2.7344,  1.1250, -0.0542, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -2.5469,  1.0469,  0.5469, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -3.1250,  0.3477,  0.1406, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1719, -1.5938,  0.7578,  2.0625, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8438, -2.4375, -0.6562,  1.1719, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.9688, -2.2344,  0.6406,  1.1875, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:57:06,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:57:06,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.39 | bwd_microstep: 103.02 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 101.56 | step_microstep: 2.29
[2025-11-06 17:57:06,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.17 | bwd: 104.09 | bwd_inner: 2.37 | bwd_allreduce: 101.60 | step: 2.38
 13%|█▎        | 447/3507 [12:20<1:00:19,  1.18s/it]                                                    {'loss': 0.8027, 'learning_rate': 1.9507992852919496e-05, 'epoch': 0.13}
 13%|█▎        | 447/3507 [12:20<1:00:19,  1.18s/it]tensor([[-3.2344, -2.2344,  1.0703,  0.5469, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -2.8906,  0.3027,  0.2363, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219, -3.1562, -0.2500,  0.3496, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:07,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.01 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.8281, -2.7344,  0.9141, -0.5703, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5938, -1.7344,  1.3125,  1.4219, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.2891, -0.3242,  2.2031,  0.8359, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.6875, -5.9375, -2.5156, -0.8008, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0781, -0.9648,  2.6094,  0.7891, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:07,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.24
[2025-11-06 17:57:07,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.88 | bwd_microstep: 197.96 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 196.60 | step_microstep: 1.88
[2025-11-06 17:57:07,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.90 | bwd: 198.99 | bwd_inner: 2.17 | bwd_allreduce: 196.66 | step: 1.98
 13%|█▎        | 448/3507 [12:21<59:27,  1.17s/it]                                                    {'loss': 0.7433, 'learning_rate': 1.9505127023498603e-05, 'epoch': 0.13}
 13%|█▎        | 448/3507 [12:21<59:27,  1.17s/it]tensor([[-5.2500, -4.3438, -0.7812, -0.4512, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -2.8906, -0.3066,  1.0781, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2031, -2.2344,  1.2969,  0.8594, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:08,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.46 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.9688, -7.1562, -3.4062, -2.0000, -6.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[3.1250, 3.7656, 5.3438, 4.8438, 3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0133,  0.4590,  2.0781,  3.2188,  0.2354]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281, -2.0312,  0.6641,  0.5273, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9219, -2.2188,  0.5664,  1.4688, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:57:08,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 17:57:08,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.24 | bwd_microstep: 75.10 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 73.95 | step_microstep: 2.57
[2025-11-06 17:57:08,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.74 | bwd: 76.10 | bwd_inner: 1.98 | bwd_allreduce: 73.99 | step: 2.66
 13%|█▎        | 449/3507 [12:22<49:38,  1.03it/s]                                                  {'loss': 0.567, 'learning_rate': 1.950225308363846e-05, 'epoch': 0.13}
 13%|█▎        | 449/3507 [12:22<49:38,  1.03it/s]tensor([[-4.5938, -3.8281, -0.6914,  0.3203, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -2.7812,  1.0547,  0.3125, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -2.9531,  0.2197,  1.1406, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4141, -0.5273,  1.8984,  0.3008, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0625, -1.2734,  1.7109,  1.9531, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7500, -5.7500, -1.5312, -0.8789, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:10,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.27 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.8125, -2.9062,  0.6953,  0.9805, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.4824,  0.3105,  2.9219,  2.6562, -0.2197]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:10,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 17:57:10,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.79 | bwd_microstep: 2.54 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 1.15 | step_microstep: 1.79
[2025-11-06 17:57:10,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.09 | bwd: 3.57 | bwd_inner: 2.19 | bwd_allreduce: 1.20 | step: 1.89
 13%|█▎        | 450/3507 [12:24<1:02:06,  1.22s/it]                                                    {'loss': 0.5955, 'learning_rate': 1.949937103579131e-05, 'epoch': 0.13}
 13%|█▎        | 450/3507 [12:24<1:02:06,  1.22s/it]tensor([[-2.8125, -1.7578,  1.9141,  1.0078, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9688, -2.2969,  0.2949,  1.3594, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:10,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.03 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.0625, -3.0000,  0.8633, -0.1157, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -3.1250,  0.2656,  1.1641, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5938, -3.0625, -0.7344,  0.9961, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -3.1562,  0.0422,  0.6875, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -2.6875,  0.3691,  0.5625, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7656, -0.7305,  2.4688,  0.9688, -1.3984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:57:11,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:57:11,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.16 | bwd_microstep: 715.05 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 713.96 | step_microstep: 1.57
[2025-11-06 17:57:11,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.20 | bwd: 716.04 | bwd_inner: 1.91 | bwd_allreduce: 714.00 | step: 1.66
 13%|█▎        | 451/3507 [12:25<1:01:29,  1.21s/it]                                                    {'loss': 0.5687, 'learning_rate': 1.9496480882416316e-05, 'epoch': 0.13}
 13%|█▎        | 451/3507 [12:25<1:01:29,  1.21s/it]tensor([[-2.1406, -1.3047,  1.4453,  0.9492, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -4.7188, -0.4570, -1.1797, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -3.2969, -0.4980,  0.7148, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0938, -0.9766,  2.3906,  0.5742, -1.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -2.0312,  1.0703,  0.9062, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3281, -1.3672,  1.8750,  1.0938, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:12,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.77 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6875, -2.0625,  0.3711,  1.4141, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6406, -1.7031,  1.3906,  0.6250, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:14,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:57:14,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.54 | bwd_microstep: 1627.83 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 1626.41 | step_microstep: 1.77
[2025-11-06 17:57:14,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.34 | bwd: 1628.81 | bwd_inner: 2.24 | bwd_allreduce: 1626.45 | step: 1.85
 13%|█▎        | 452/3507 [12:28<1:26:53,  1.71s/it]                                                    {'loss': 0.8184, 'learning_rate': 1.949358262597957e-05, 'epoch': 0.13}
 13%|█▎        | 452/3507 [12:28<1:26:53,  1.71s/it]tensor([[-3.1875, -2.1094,  1.6172,  0.5664, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -3.0312,  0.4609,  0.7695, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[2.1094, 2.7188, 4.3750, 4.1250, 2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:14,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.08 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[0.0192, 0.6406, 2.7344, 3.2188, 0.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -3.5312, -1.2031,  1.0078, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375, -2.9688, -0.8633,  1.1953, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.4531, -1.7734,  0.8047,  1.8125, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -3.2812,  0.6953, -0.5859, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:57:14,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:57:14,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.27 | bwd_microstep: 71.00 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 69.88 | step_microstep: 1.61
[2025-11-06 17:57:14,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.38 | bwd: 71.97 | bwd_inner: 1.92 | bwd_allreduce: 69.93 | step: 1.69
 13%|█▎        | 453/3507 [12:28<1:08:25,  1.34s/it]                                                    {'loss': 0.929, 'learning_rate': 1.9490676268954063e-05, 'epoch': 0.13}
 13%|█▎        | 453/3507 [12:28<1:08:25,  1.34s/it]tensor([[-3.2969, -2.4375,  0.6289,  0.5000, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:14,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.70 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.6250, -1.6875,  1.6016,  1.2109, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594, -2.7969, -0.3926,  1.2656, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9531, -2.0625,  1.0234,  0.9688, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -3.3438, -0.6484,  0.7305, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7188, -2.1406,  0.2119,  1.3438, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6875, -1.7891,  1.4297,  0.8672, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2188, -1.2812,  1.6484,  0.3027, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:16,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:57:16,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.69 | bwd_microstep: 1569.66 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1568.47 | step_microstep: 1.73
[2025-11-06 17:57:16,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.42 | bwd: 1570.57 | bwd_inner: 1.92 | bwd_allreduce: 1568.51 | step: 1.81
 13%|█▎        | 454/3507 [12:30<1:16:25,  1.50s/it]                                                    {'loss': 0.6383, 'learning_rate': 1.9487761813819698e-05, 'epoch': 0.13}
 13%|█▎        | 454/3507 [12:30<1:16:25,  1.50s/it]tensor([[-3.6094, -2.6562,  0.8125,  0.4883, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:16,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.85 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8438, -1.7969,  1.5703,  0.5117, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -2.2188,  0.1069,  1.7656, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6719, -2.0938,  0.3535,  1.8594, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9062, -2.4062, -0.2002,  1.5938, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4688, -1.3828,  2.1094,  0.1177, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -2.9531, -0.5156,  1.4141, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5625, -1.9141,  0.6836,  1.9844, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:57:17,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 17:57:17,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.11 | bwd_microstep: 539.85 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 538.82 | step_microstep: 1.81
[2025-11-06 17:57:17,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.98 | bwd: 540.77 | bwd_inner: 1.78 | bwd_allreduce: 538.86 | step: 1.89
 13%|█▎        | 455/3507 [12:31<1:06:24,  1.31s/it]                                                    {'loss': 0.3152, 'learning_rate': 1.9484839263063294e-05, 'epoch': 0.13}
 13%|█▎        | 455/3507 [12:31<1:06:24,  1.31s/it]tensor([[-4.4688, -3.3438,  0.6523, -0.0243, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -2.8125,  0.0569,  1.0156, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0547, -0.1738,  2.4844,  1.7266, -0.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:17,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.71 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.3750, -2.7188, -0.0320,  1.3906, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -3.1094, -0.0264,  0.0952, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5469, -2.6406,  0.6484,  0.6367, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6406, -0.4824,  2.8594,  0.4590, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7344, -2.1250,  0.3125,  1.6250, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:19,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.26 | optimizer_step: 0.33
[2025-11-06 17:57:19,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.94 | bwd_microstep: 1837.84 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1836.76 | step_microstep: 2.54
[2025-11-06 17:57:19,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.67 | bwd: 1838.69 | bwd_inner: 1.69 | bwd_allreduce: 1836.83 | step: 2.65
 13%|█▎        | 456/3507 [12:33<1:20:28,  1.58s/it]                                                    {'loss': 0.5187, 'learning_rate': 1.9481908619178576e-05, 'epoch': 0.13}
 13%|█▎        | 456/3507 [12:33<1:20:28,  1.58s/it]tensor([[-3.0156, -1.8281,  1.7891, -0.3965, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531, -2.4062, -0.1602,  1.4688, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -2.8906,  0.0256,  1.1562, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6094, -1.7109,  1.3125,  0.7344, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:19,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.47 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.0938, -3.3125, -0.0625,  1.1094, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5000, -1.3125,  1.8828, -0.1777, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.9375, -2.0625,  0.9648,  0.9648, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8750, -2.3750, -0.2402,  1.6016, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:20,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.19 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:57:20,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.62 | bwd_microstep: 16.03 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 14.93 | step_microstep: 3.53
[2025-11-06 17:57:20,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.10 | bwd: 16.93 | bwd_inner: 1.83 | bwd_allreduce: 14.97 | step: 3.62
 13%|█▎        | 457/3507 [12:33<1:03:12,  1.24s/it]                                                    {'loss': 0.7236, 'learning_rate': 1.9478969884666173e-05, 'epoch': 0.13}
 13%|█▎        | 457/3507 [12:33<1:03:12,  1.24s/it]tensor([[-5.0312, -4.2500, -1.1328, -0.1406, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9062, -3.1719, -0.2402,  0.6289, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -1.8906,  0.5156,  1.5000, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312e+00, -3.2344e+00, -3.8147e-05,  6.7969e-01, -3.3750e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:20,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.49 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.3125, -2.1719,  1.5938, -0.0486, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.3594, -0.3906,  0.8281, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -1.7344,  2.0781,  0.3516, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7812, -2.1562,  0.4277,  1.7422, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:21,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 17:57:21,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.60 | bwd_microstep: 1230.52 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1229.38 | step_microstep: 1.94
[2025-11-06 17:57:21,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 444.12 | bwd: 1231.42 | bwd_inner: 1.87 | bwd_allreduce: 1229.43 | step: 2.03
 13%|█▎        | 458/3507 [12:35<1:10:28,  1.39s/it]                                                    {'loss': 0.3152, 'learning_rate': 1.9476023062033617e-05, 'epoch': 0.13}
 13%|█▎        | 458/3507 [12:35<1:10:28,  1.39s/it]tensor([[-2.5000, -1.3125,  2.2812,  0.3613, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.0938, -0.6172, -0.0559, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:22,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.11 | bwd_microstep: 1.20 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8281, -3.0312,  0.1104,  0.9375, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0938e+00, -1.6562e+00,  1.1139e-03,  1.3516e+00, -1.5859e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -1.7969,  0.7500,  1.5312, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625, -1.6016,  1.3984,  0.8750, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1250, -1.9297,  1.4922, -0.5273, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.6094, -2.9688, -0.4316,  0.8281, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:22,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:57:22,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.64 | bwd_microstep: 409.28 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 408.23 | step_microstep: 1.64
[2025-11-06 17:57:22,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.77 | bwd: 410.48 | bwd_inner: 2.09 | bwd_allreduce: 408.26 | step: 1.72
 13%|█▎        | 459/3507 [12:36<1:01:48,  1.22s/it]                                                    {'loss': 0.9137, 'learning_rate': 1.9473068153795353e-05, 'epoch': 0.13}
 13%|█▎        | 459/3507 [12:36<1:01:48,  1.22s/it]tensor([[-2.1875, -1.7734, -0.0525,  1.7500, -1.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:22,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.49 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.3281, -2.3125,  1.2891,  0.6523, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -2.3125, -0.0286,  1.4766, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -2.2656, -0.0459,  1.4922, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969, -2.6250, -0.0698,  1.1016, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0000, -2.2031,  0.6445,  1.4922, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.4844, -0.2490,  1.0547, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -1.3672,  1.7422,  0.8125, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:23,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:57:23,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.49 | bwd_microstep: 812.82 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 811.61 | step_microstep: 1.75
[2025-11-06 17:57:23,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.99 | bwd: 813.75 | bwd_inner: 1.99 | bwd_allreduce: 811.64 | step: 1.82
 13%|█▎        | 460/3507 [12:37<1:00:56,  1.20s/it]                                                    {'loss': 0.3164, 'learning_rate': 1.9470105162472705e-05, 'epoch': 0.13}
 13%|█▎        | 460/3507 [12:37<1:00:56,  1.20s/it]tensor([[-2.6406, -1.9375,  0.7500,  1.9531, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -2.9375, -0.5234,  0.9766, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -3.2969, -0.2871,  0.6211, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:24,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.65 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7656, -1.8672,  0.9922,  0.4434, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.5625, -0.0055,  0.4004, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4062, -2.2656,  1.4531, -0.2041, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.5156, -2.7500,  0.1318,  0.7539, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.0781,  0.6133,  0.3848, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:25,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.40 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:57:25,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.86 | bwd_microstep: 1398.16 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 1396.84 | step_microstep: 3.22
[2025-11-06 17:57:25,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.54 | bwd: 1399.01 | bwd_inner: 2.00 | bwd_allreduce: 1396.88 | step: 3.30
 13%|█▎        | 461/3507 [12:39<1:10:17,  1.38s/it]                                                    {'loss': 1.0916, 'learning_rate': 1.946713409059391e-05, 'epoch': 0.13}
 13%|█▎        | 461/3507 [12:39<1:10:17,  1.38s/it]tensor([[-3.2031, -2.1406,  1.4531,  0.6562, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2188, -1.7656,  0.1377,  1.7891, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -2.7812,  0.7812,  0.1875, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:25,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.05 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5312, -1.8203,  0.8906,  1.8125, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -2.4219,  0.2275,  1.1953, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -2.9219,  0.3301,  0.3125, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1250, -2.3594,  0.3555,  1.0078, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -3.3438,  0.1504,  0.5234, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:26,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:57:26,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.59 | bwd_microstep: 13.76 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 12.76 | step_microstep: 1.53
[2025-11-06 17:57:26,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.66 | bwd: 14.64 | bwd_inner: 1.71 | bwd_allreduce: 12.80 | step: 1.62
 13%|█▎        | 462/3507 [12:39<55:30,  1.09s/it]                                                    {'loss': 0.7479, 'learning_rate': 1.9464154940694086e-05, 'epoch': 0.13}
 13%|█▎        | 462/3507 [12:39<55:30,  1.09s/it]tensor([[-3.0625, -2.4531, -0.1797,  1.1406, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2188, -2.1406,  1.3594,  0.2090, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5625, -5.6562, -1.9375, -0.8281, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:26,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.39 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5234, -0.4570,  2.5156,  0.8555, -1.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -2.3750,  1.2109, -0.0781, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -2.8750,  0.3457, -0.0608, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2656, -2.5312,  0.1836,  0.9922, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -2.9062,  0.2305,  0.3984, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:27,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:57:27,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 103.00 | bwd_microstep: 1085.40 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1084.20 | step_microstep: 1.97
[2025-11-06 17:57:27,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 281.41 | bwd: 1086.25 | bwd_inner: 1.88 | bwd_allreduce: 1084.24 | step: 2.04
 13%|█▎        | 463/3507 [12:41<1:07:16,  1.33s/it]                                                    {'loss': 0.3843, 'learning_rate': 1.9461167715315264e-05, 'epoch': 0.13}
 13%|█▎        | 463/3507 [12:41<1:07:16,  1.33s/it]tensor([[-2.5938, -1.3828,  1.8359, -0.2256, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.8906, -2.8750,  0.5898,  0.3105, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2188, -2.3125,  0.7148,  0.3906, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3906, -1.1797,  2.1875,  0.1670, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:28,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.60 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -3.2656,  0.5391,  0.6641, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -2.2500,  1.6406, -0.5352, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7188, -1.1250,  1.0938,  2.2656, -1.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -2.4844, -0.2676,  1.2812, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:28,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:57:28,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.29 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.60
[2025-11-06 17:57:28,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.90 | bwd: 3.09 | bwd_inner: 2.06 | bwd_allreduce: 0.90 | step: 1.68
 13%|█▎        | 464/3507 [12:42<54:12,  1.07s/it]                                                    {'loss': 0.8092, 'learning_rate': 1.9458172417006347e-05, 'epoch': 0.13}
 13%|█▎        | 464/3507 [12:42<54:12,  1.07s/it]tensor([[-1.9375, -1.3359,  0.8281,  2.2812, -1.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3906, -1.7578,  0.7383,  1.9297, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -1.6250,  1.7734,  0.0630, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:28,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.19 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8594, -3.1406, -0.2852,  1.0234, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -2.7656,  0.7500,  0.3398, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -3.0156, -0.4531,  0.6094, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -1.2578,  1.8906,  0.3145, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -3.5000,  0.3867, -0.0240, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:30,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 17:57:30,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.01 | bwd_microstep: 498.28 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 497.02 | step_microstep: 2.08
[2025-11-06 17:57:30,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 487.22 | bwd: 499.18 | bwd_inner: 1.97 | bwd_allreduce: 497.06 | step: 2.15
 13%|█▎        | 465/3507 [12:44<1:05:39,  1.30s/it]                                                    {'loss': 0.4742, 'learning_rate': 1.9455169048323136e-05, 'epoch': 0.13}
 13%|█▎        | 465/3507 [12:44<1:05:39,  1.30s/it]tensor([[-2.4375, -1.8906,  0.3008,  1.7969, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-4.3125, -3.7656, -1.3594,  0.8516, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:0')
tensor([3], device='cuda:1')
[2025-11-06 17:57:30,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.02 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9844, -1.9844,  1.0469,  0.5820, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9844, -2.7031,  1.3672, -0.6680, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6406, -1.6641,  1.2812,  0.0737, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -3.0781,  0.9102, -0.3594, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9219, -1.7734,  1.8516,  0.0183, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -2.9688,  0.4375, -0.1787, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:57:30,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 17:57:30,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.02 | bwd_microstep: 182.13 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 180.64 | step_microstep: 1.88
[2025-11-06 17:57:30,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.06 | bwd: 183.17 | bwd_inner: 2.36 | bwd_allreduce: 180.68 | step: 1.97
 13%|█▎        | 466/3507 [12:44<54:01,  1.07s/it]                                                    {'loss': 0.2946, 'learning_rate': 1.9452157611828312e-05, 'epoch': 0.13}
 13%|█▎        | 466/3507 [12:44<54:01,  1.07s/it]tensor([[-2.8125, -1.9297,  1.1094,  0.9453, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:30,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.21 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-5.3438, -4.2812, -0.3184, -0.6445, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -3.7188,  0.0613, -1.4062, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.9219,  0.0549,  0.0928, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[2.9688, 3.6250, 5.2812, 4.7812, 2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8906, -2.1094,  0.6641,  1.2344, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -3.2969,  0.2715, -0.6680, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781, -1.6484,  1.2344,  0.9648, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:33,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:57:33,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.84 | bwd_microstep: 1790.47 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1789.43 | step_microstep: 1.86
[2025-11-06 17:57:33,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.07 | bwd: 1791.55 | bwd_inner: 1.94 | bwd_allreduce: 1789.48 | step: 1.96
 13%|█▎        | 467/3507 [12:47<1:19:24,  1.57s/it]                                                    {'loss': 0.6783, 'learning_rate': 1.9449138110091444e-05, 'epoch': 0.13}
 13%|█▎        | 467/3507 [12:47<1:19:24,  1.57s/it]tensor([[-3.7188, -2.7500,  0.4805, -0.0598, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -1.9219,  1.6484,  0.1309, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:33,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.06 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-1.3906, -0.6406,  1.8281,  2.1562, -1.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -2.0156,  0.7812,  1.8516, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9375, -2.3281,  0.1196,  1.8438, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8359, -1.3203,  0.6680,  2.1719, -1.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6094, -2.0781,  0.0630,  1.7500, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5781, -0.9219,  1.3047,  1.8438, -1.1953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:33,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.27
[2025-11-06 17:57:33,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.97 | bwd_microstep: 57.49 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 56.29 | step_microstep: 2.27
[2025-11-06 17:57:33,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.04 | bwd: 58.51 | bwd_inner: 1.98 | bwd_allreduce: 56.35 | step: 2.39
 13%|█▎        | 468/3507 [12:47<1:01:52,  1.22s/it]                                                    {'loss': 0.3647, 'learning_rate': 1.9446110545688983e-05, 'epoch': 0.13}
 13%|█▎        | 468/3507 [12:47<1:01:52,  1.22s/it]tensor([[-2.8125, -2.0156,  0.8633,  1.5859, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -3.0469,  0.2969,  0.6172, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1094, -1.8281,  1.8828, -0.3242, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0156, -2.2812,  0.3633,  1.5312, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:34,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.03 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.3906, -1.6562,  0.8711,  1.6641, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7812, -1.7891,  1.4141,  0.7109, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3125, -2.7969, -0.8047,  0.4512, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.8125, -2.6875,  1.0469,  0.4805, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:36,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:57:36,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.14 | bwd_microstep: 1760.34 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1759.20 | step_microstep: 1.91
[2025-11-06 17:57:36,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.19 | bwd: 1761.25 | bwd_inner: 1.85 | bwd_allreduce: 1759.25 | step: 2.00
 13%|█▎        | 469/3507 [12:50<1:21:44,  1.61s/it]                                                    {'loss': 0.8917, 'learning_rate': 1.944307492120426e-05, 'epoch': 0.13}
 13%|█▎        | 469/3507 [12:50<1:21:44,  1.61s/it]tensor([[-2.5156, -1.4844,  1.6328,  0.4258, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6250, -6.0625, -3.3750, -0.8359, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4375, -0.6484,  2.0625,  2.6250, -1.0234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.0625, -0.4375,  0.1973, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:36,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.62 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6719, -3.0938, -0.7461,  1.3125, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.9688,  0.8047,  0.3477, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7500, -1.5625,  1.8203,  0.3027, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -4.5000, -1.5000,  0.1562, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:36,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:57:36,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.98 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.66
[2025-11-06 17:57:36,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.64 | bwd: 2.80 | bwd_inner: 1.79 | bwd_allreduce: 0.88 | step: 1.74
 13%|█▎        | 470/3507 [12:50<1:04:00,  1.26s/it]                                                    {'loss': 0.4599, 'learning_rate': 1.9440031239227476e-05, 'epoch': 0.13}
 13%|█▎        | 470/3507 [12:50<1:04:00,  1.26s/it]tensor([[-3.8594, -3.3281, -1.0781,  0.8945, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -1.8906,  0.8828,  0.6367, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -2.2344,  1.3750, -0.2500, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -2.4375,  0.0532,  1.6250, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[2.0625, 2.8438, 4.6250, 3.3594, 2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:57:37,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.51 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0781, -1.5156,  0.6602,  2.2812, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2812, -1.5859,  0.9531,  2.0781, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5938, -1.4141,  2.3125,  0.8438, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:57:38,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:57:38,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 633.33 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 632.17 | step_microstep: 1.98
[2025-11-06 17:57:38,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.82 | bwd: 634.21 | bwd_inner: 1.88 | bwd_allreduce: 632.21 | step: 2.06
 13%|█▎        | 471/3507 [12:52<1:09:30,  1.37s/it]                                                    {'loss': 0.5378, 'learning_rate': 1.9436979502355725e-05, 'epoch': 0.13}
 13%|█▎        | 471/3507 [12:52<1:09:30,  1.37s/it]tensor([[-3.0625, -2.0312,  1.3672,  0.6719, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531, -1.9297,  1.2969,  0.8164, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5156, -1.7969,  0.7500,  1.2812, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:38,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.54 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8281, -2.8750,  0.6523,  1.0859, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250, -2.1562,  1.1328,  0.8125, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.1875, -0.1367,  0.5195, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0156, -2.0625,  1.0469,  0.8047, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -3.7969,  0.0308,  0.1465, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:39,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:57:39,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.38 | bwd_microstep: 2.12 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.97
[2025-11-06 17:57:39,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.90 | bwd: 3.01 | bwd_inner: 2.04 | bwd_allreduce: 0.85 | step: 2.05
 13%|█▎        | 472/3507 [12:53<1:05:01,  1.29s/it]                                                    {'loss': 0.6951, 'learning_rate': 1.9433919713192952e-05, 'epoch': 0.13}
 13%|█▎        | 472/3507 [12:53<1:05:01,  1.29s/it]tensor([[-4.2188, -3.6719, -1.2891,  0.7539, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0469, -1.1406,  1.3672,  0.8828, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4688, -1.5391,  1.4297,  1.2891, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:39,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.60 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.3750, -3.4375, -0.1455,  0.0659, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6406, -1.4922,  1.9531,  0.0991, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1875, -2.2500,  0.7773,  0.8984, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1875, -1.1094,  2.2031,  1.1094, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -3.2188,  0.1787,  0.9062, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:41,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 17:57:41,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.47 | bwd_microstep: 1390.24 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1389.18 | step_microstep: 2.03
[2025-11-06 17:57:41,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.10 | bwd: 1391.05 | bwd_inner: 1.69 | bwd_allreduce: 1389.23 | step: 2.09
 13%|█▎        | 473/3507 [12:55<1:12:26,  1.43s/it]                                                    {'loss': 0.4847, 'learning_rate': 1.9430851874349983e-05, 'epoch': 0.13}
 13%|█▎        | 473/3507 [12:55<1:12:26,  1.43s/it]tensor([[-5.5938, -4.8438, -1.8203,  0.3008, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:41,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.99 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.7188, -3.6719, -0.2197, -0.2617, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -3.2344, -0.7422,  0.8242, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -2.8281,  0.9531,  0.6836, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5781, -1.9062,  0.6602,  1.7812, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281, -2.6094,  0.1689,  1.4922, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344, -2.7031, -0.5781,  1.2578, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -3.1562, -0.2930,  1.2812, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:41,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.24
[2025-11-06 17:57:41,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.66 | bwd_microstep: 2.38 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.96 | step_microstep: 2.09
[2025-11-06 17:57:41,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 418.67 | bwd: 3.50 | bwd_inner: 2.36 | bwd_allreduce: 1.00 | step: 2.17
 14%|█▎        | 474/3507 [12:55<58:44,  1.16s/it]                                                    {'loss': 0.3315, 'learning_rate': 1.942777598844452e-05, 'epoch': 0.14}
 14%|█▎        | 474/3507 [12:55<58:44,  1.16s/it]tensor([[-2.7031, -1.7734,  1.3438,  1.3359, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -1.7500,  1.7969, -0.4805, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -3.4844,  0.2988,  0.5742, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6094, -1.9922,  0.1836,  1.5156, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.2969, -2.2188,  1.2656,  0.4727, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:42,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.83 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5625, -2.3906,  1.2188, -0.1484, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -3.1562, -0.3867,  1.4219, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3438, -1.3672,  1.4062,  0.7578, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:42,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.09 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:57:42,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.21 | bwd_microstep: 2.30 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.59
[2025-11-06 17:57:42,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.07 | bwd: 3.30 | bwd_inner: 2.35 | bwd_allreduce: 0.81 | step: 2.68
 14%|█▎        | 475/3507 [12:56<56:48,  1.12s/it]                                                  {'loss': 0.9651, 'learning_rate': 1.9424692058101123e-05, 'epoch': 0.14}
 14%|█▎        | 475/3507 [12:56<56:48,  1.12s/it]tensor([[-4.0938, -3.1875,  0.1191,  0.8320, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2344, -1.2969,  1.5234,  1.2578, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.1094,  0.5703, -0.0708, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:43,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.09 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0312, -2.8438,  0.5898, -0.6055, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -2.5625,  0.7305,  0.1543, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8750, -1.7188,  1.9141,  0.7578, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -3.8750, -0.1924,  0.4023, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7656, -1.8672,  0.9844,  0.6562, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:57:44,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.23 | optimizer_step: 0.25
[2025-11-06 17:57:44,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.52 | bwd_microstep: 662.57 | bwd_inner_microstep: 1.57 | bwd_allreduce_microstep: 660.86 | step_microstep: 7.53
[2025-11-06 17:57:44,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 528.66 | bwd: 663.65 | bwd_inner: 2.56 | bwd_allreduce: 660.92 | step: 7.62
 14%|█▎        | 476/3507 [12:58<1:07:44,  1.34s/it]                                                    {'loss': 0.6113, 'learning_rate': 1.942160008595121e-05, 'epoch': 0.14}
 14%|█▎        | 476/3507 [12:58<1:07:44,  1.34s/it]tensor([[-3.0156, -2.2969,  0.3438,  1.0312, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:44,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.43 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3750, -3.5781, -0.5312,  0.6680, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0312, -1.5312,  0.3789,  1.9531, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.5625, -0.7539,  0.6836, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594, -1.6172,  2.1406,  0.2402, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3594, -1.2969,  1.7344, -0.0228, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -2.8438,  0.5391,  0.3359, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2188, -2.6094, -0.0693,  1.8203, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:45,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:57:45,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.38 | bwd_microstep: 816.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 816.01 | step_microstep: 2.04
[2025-11-06 17:57:45,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.84 | bwd: 817.83 | bwd_inner: 1.66 | bwd_allreduce: 816.05 | step: 2.12
 14%|█▎        | 477/3507 [12:59<1:05:17,  1.29s/it]                                                    {'loss': 0.4823, 'learning_rate': 1.941850007463307e-05, 'epoch': 0.14}
 14%|█▎        | 477/3507 [12:59<1:05:17,  1.29s/it]tensor([[-3.2969, -2.7969, -0.6875,  1.5000, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -3.6094, -0.3027,  0.2139, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:46,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.15 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.5938, -2.9688, -0.4844,  1.0703, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -2.7500,  0.9453, -0.0869, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.8750, -2.1406, -0.3242, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6719, -2.2656,  1.6953, -0.7422, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.4688, -6.2500, -1.8281, -2.1406, -6.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4062, -2.8281, -0.4570,  1.5703, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:57:48,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.31 | optimizer_step: 0.29
[2025-11-06 17:57:48,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.46 | bwd_microstep: 1927.45 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1926.34 | step_microstep: 3.35
[2025-11-06 17:57:48,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.64 | bwd: 1928.43 | bwd_inner: 1.88 | bwd_allreduce: 1926.40 | step: 3.43
 14%|█▎        | 478/3507 [13:02<1:21:10,  1.61s/it]                                                    {'loss': 0.2762, 'learning_rate': 1.9415392026791857e-05, 'epoch': 0.14}
 14%|█▎        | 478/3507 [13:02<1:21:10,  1.61s/it]tensor([[-3.1562, -1.9219,  1.8672, -0.0173, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:48,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.06 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0469, -1.9766,  1.3906,  0.8672, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -4.0000,  0.1826, -0.8945, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -1.8359,  1.9766,  0.1465, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3906, -1.8984, -0.1006,  1.5312, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-1.0000, -0.5859,  0.9219,  2.7500, -0.6211]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.1875, -7.0625, -2.6875, -1.9922, -7.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -2.7969, -0.0070,  1.0078, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:49,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.49 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:57:49,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.37 | bwd_microstep: 399.47 | bwd_inner_microstep: 2.20 | bwd_allreduce_microstep: 397.11 | step_microstep: 3.42
[2025-11-06 17:57:49,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.44 | bwd: 400.41 | bwd_inner: 3.09 | bwd_allreduce: 397.13 | step: 3.51
 14%|█▎        | 479/3507 [13:02<1:07:33,  1.34s/it]                                                    {'loss': 0.8073, 'learning_rate': 1.9412275945079568e-05, 'epoch': 0.14}
 14%|█▎        | 479/3507 [13:02<1:07:33,  1.34s/it]tensor([[-3.9844, -3.3750, -0.9492,  0.5625, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.6406, -1.0391,  0.6602, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.0938,  1.0391, -0.5039, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -3.5312,  0.5625, -1.5312, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8750, -2.9688,  0.4180,  0.7891, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:49,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.99 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.5938, -3.7969, -0.5234,  0.8203, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969, -2.2031,  1.5234,  0.8203, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7188, -2.3750,  1.6562, -0.5156, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:57:49,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:57:49,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 317.16 | bwd_microstep: 81.87 | bwd_inner_microstep: 1.77 | bwd_allreduce_microstep: 79.97 | step_microstep: 2.60
[2025-11-06 17:57:49,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 487.20 | bwd: 82.86 | bwd_inner: 2.69 | bwd_allreduce: 80.00 | step: 2.68
 14%|█▎        | 480/3507 [13:03<1:01:01,  1.21s/it]                                                    {'loss': 0.785, 'learning_rate': 1.940915183215506e-05, 'epoch': 0.14}
 14%|█▎        | 480/3507 [13:03<1:01:01,  1.21s/it]tensor([[-3.6250, -2.6250,  0.8242,  0.2021, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1875, -1.1016,  1.9688,  0.1084, -1.8047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6562, -1.6797,  1.3281,  0.4355, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -2.1562,  1.6250, -0.0967, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:50,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.46 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.3438, -2.2656,  1.2656,  0.2383, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -1.9297,  1.7344,  0.0317, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -3.4375, -0.2451,  0.2715, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -4.0312, -0.5625, -0.8047, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:57:52,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 17:57:52,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.74 | bwd_microstep: 1871.05 | bwd_inner_microstep: 2.86 | bwd_allreduce_microstep: 1867.83 | step_microstep: 2.97
[2025-11-06 17:57:52,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 465.29 | bwd: 1872.16 | bwd_inner: 3.98 | bwd_allreduce: 1867.86 | step: 3.07
 14%|█▎        | 481/3507 [13:06<1:18:45,  1.56s/it]                                                    {'loss': 0.4011, 'learning_rate': 1.9406019690684054e-05, 'epoch': 0.14}
 14%|█▎        | 481/3507 [13:06<1:18:45,  1.56s/it]tensor([[-2.9688, -1.7109,  1.6719, -0.2500, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -2.9844,  0.0996,  0.9648, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3438, -1.7188,  0.6680,  2.0469, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:52,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.90 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-4.9062, -4.1875, -1.2891,  0.3848, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7344, -1.5391,  1.6016, -0.2871, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7500, -2.4375,  1.8516,  0.4512, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -3.3438,  0.5391,  0.3828, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -2.7812,  0.3926,  0.8477, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:57:53,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.34 | optimizer_step: 0.35
[2025-11-06 17:57:53,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.95 | bwd_microstep: 1060.81 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1059.47 | step_microstep: 3.33
[2025-11-06 17:57:53,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.87 | bwd: 1061.76 | bwd_inner: 2.00 | bwd_allreduce: 1059.55 | step: 3.44
 14%|█▎        | 482/3507 [13:07<1:17:23,  1.53s/it]                                                    {'loss': 0.3877, 'learning_rate': 1.9402879523339103e-05, 'epoch': 0.14}
 14%|█▎        | 482/3507 [13:07<1:17:23,  1.53s/it]tensor([[-3.0625, -1.9141,  1.6406, -0.2656, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -2.9688, -0.5664,  1.1797, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -3.6250, -0.8945,  1.0156, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:54,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.67 | bwd_microstep: 1.69 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.17
tensor([[-2.7031, -2.0000,  0.6016,  1.9453, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.1562,  0.3770,  0.6250, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719, -2.7031, -0.6875,  1.4062, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7656, -1.9453,  0.9453,  1.7188, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2344, -1.4844,  1.1875,  2.2344, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:54,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:57:54,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.44 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.20
[2025-11-06 17:57:54,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.11 | bwd: 3.52 | bwd_inner: 2.52 | bwd_allreduce: 0.80 | step: 2.37
 14%|█▍        | 483/3507 [13:08<1:00:22,  1.20s/it]                                                    {'loss': 0.6061, 'learning_rate': 1.939973133279962e-05, 'epoch': 0.14}
 14%|█▍        | 483/3507 [13:08<1:00:22,  1.20s/it]tensor([[-2.0938, -1.0859,  2.2188,  1.2422, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281, -2.4844,  0.5391,  1.2031, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3125, -1.5938,  0.8477,  1.4766, -1.8047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9375, -3.0781,  0.1895,  1.0469, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9141, -0.1318,  2.4844,  2.7812, -0.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:55,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.03 | bwd_microstep: 2.31 | bwd_inner_microstep: 2.02 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.21
tensor([[-3.4844, -2.3125,  1.0938, -0.6758, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -4.0938, -0.9688, -0.2148, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.1094,  0.8555, -1.2812, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:55,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 17:57:55,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 254.61 | bwd_microstep: 5.99 | bwd_inner_microstep: 3.90 | bwd_allreduce_microstep: 1.87 | step_microstep: 2.86
[2025-11-06 17:57:55,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 543.73 | bwd: 8.29 | bwd_inner: 5.98 | bwd_allreduce: 1.97 | step: 3.08
 14%|█▍        | 484/3507 [13:09<58:52,  1.17s/it]                                                    {'loss': 0.3854, 'learning_rate': 1.9396575121751863e-05, 'epoch': 0.14}
 14%|█▍        | 484/3507 [13:09<58:52,  1.17s/it]tensor([[-2.4531, -1.4062,  1.7031,  0.1226, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -5.8438, -2.4219, -0.9141, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -2.2188,  0.8711,  0.8047, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1641, -0.4492,  1.9297,  2.6875, -0.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -2.5312,  0.6797,  0.5664, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4531, -1.6641,  0.9062,  1.4766, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:56,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.81 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4688, -2.7969, -0.1533,  1.3984, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -2.2500,  0.7148,  0.7148, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:57,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:57:57,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.80 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.79
[2025-11-06 17:57:57,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.62 | bwd: 2.91 | bwd_inner: 1.90 | bwd_allreduce: 0.89 | step: 2.87
 14%|█▍        | 485/3507 [13:10<1:07:11,  1.33s/it]                                                    {'loss': 0.6765, 'learning_rate': 1.939341089288893e-05, 'epoch': 0.14}
 14%|█▍        | 485/3507 [13:10<1:07:11,  1.33s/it]tensor([[-3.3750, -2.6562,  0.1436,  1.3125, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -3.7812, -0.6250,  0.2500, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7812, -1.8047,  1.2891,  0.8086, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -3.2969, -0.4902,  1.0234, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -2.2656,  1.3828, -0.4336, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -2.0312,  1.6484, -0.0187, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.7188, -0.2676,  0.8281, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:57,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.42 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -4.3438, -1.2500,  0.3027, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:57,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.24 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:57:57,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.48 | bwd_microstep: 1.71 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 3.38
[2025-11-06 17:57:57,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.92 | bwd: 2.58 | bwd_inner: 1.69 | bwd_allreduce: 0.78 | step: 3.47
 14%|█▍        | 486/3507 [13:11<1:00:34,  1.20s/it]                                                    {'loss': 0.3555, 'learning_rate': 1.9390238648910765e-05, 'epoch': 0.14}
 14%|█▍        | 486/3507 [13:11<1:00:34,  1.20s/it]tensor([[-2.7031, -2.0781,  0.2354,  1.5625, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6562, -2.6250,  0.7070,  0.2021, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406, -1.8359,  1.0156,  1.2656, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -3.6406, -0.7227,  0.7500, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -2.7969,  0.9609,  0.3145, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -2.9219,  0.4473,  0.3711, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:57:59,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.97 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.6562, -4.8125, -1.3281, -0.1553, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4219, -2.9688, -0.9648,  1.3359, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:57:59,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.34 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:57:59,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.03 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.86 | step_microstep: 3.49
[2025-11-06 17:57:59,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.02 | bwd: 2.89 | bwd_inner: 1.88 | bwd_allreduce: 0.89 | step: 3.56
 14%|█▍        | 487/3507 [13:13<1:08:19,  1.36s/it]                                                    {'loss': 0.4027, 'learning_rate': 1.9387058392524146e-05, 'epoch': 0.14}
 14%|█▍        | 487/3507 [13:13<1:08:19,  1.36s/it]tensor([[-3.4062, -2.5312,  0.6562,  1.4297, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -2.1562,  1.0234,  0.2773, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -2.7500,  0.6445,  0.1777, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -3.4375, -0.6055,  0.8086, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750, -2.6406,  0.1748,  1.2188, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9375, -4.9062, -0.7695, -0.3613, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -2.8750,  0.1030,  1.2578, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:01,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.27 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -3.6875, -1.0391,  0.9648, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:01,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:58:01,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.54 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.43
[2025-11-06 17:58:01,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.81 | bwd: 2.56 | bwd_inner: 1.57 | bwd_allreduce: 0.84 | step: 2.52
 14%|█▍        | 488/3507 [13:15<1:16:01,  1.51s/it]                                                    {'loss': 0.6052, 'learning_rate': 1.9383870126442694e-05, 'epoch': 0.14}
 14%|█▍        | 488/3507 [13:15<1:16:01,  1.51s/it]tensor([[-4.9062, -3.8125,  0.1846,  0.2637, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:01,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.06 | bwd_microstep: 2.46 | bwd_inner_microstep: 2.34 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-2.7344, -2.2344, -0.3379,  1.3125, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.3594, -2.3281,  0.9805,  0.1699, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2031, -1.0859,  1.9375, -0.0625, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -2.9688,  0.1514,  0.5742, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8281, -1.5234,  2.3438, -0.0220, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5000, -1.8516,  0.5312,  1.4453, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6094, -1.7969,  1.0938,  1.3594, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:02,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.38 | optimizer_step: 0.37
[2025-11-06 17:58:02,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 222.09 | bwd_microstep: 852.43 | bwd_inner_microstep: 1.76 | bwd_allreduce_microstep: 850.44 | step_microstep: 3.41
[2025-11-06 17:58:02,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.21 | bwd: 854.89 | bwd_inner: 4.13 | bwd_allreduce: 850.51 | step: 3.52
 14%|█▍        | 489/3507 [13:16<1:12:30,  1.44s/it]                                                    {'loss': 0.8309, 'learning_rate': 1.9380673853386855e-05, 'epoch': 0.14}
 14%|█▍        | 489/3507 [13:16<1:12:30,  1.44s/it]tensor([[-3.2969, -2.3125,  1.1719,  1.1094, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -3.2969, -1.0703,  0.7812, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438, -1.6328,  2.0312,  0.2344, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2812, -2.6406, -0.0222,  1.4922, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -2.6562,  0.7461,  0.3750, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5312, -2.9062, -0.3418,  1.2500, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2188, -2.0469,  1.7344,  0.4883, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:03,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.75 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7031, -2.7969,  0.5039,  0.7539, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:03,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:58:03,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.37 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 3.02
[2025-11-06 17:58:03,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.12 | bwd: 3.08 | bwd_inner: 2.06 | bwd_allreduce: 0.90 | step: 3.11
 14%|█▍        | 490/3507 [13:17<1:04:51,  1.29s/it]                                                    {'loss': 0.5615, 'learning_rate': 1.9377469576083917e-05, 'epoch': 0.14}
 14%|█▍        | 490/3507 [13:17<1:04:51,  1.29s/it]tensor([[-0.7422, -0.3008,  1.1016,  2.5781, -0.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -2.6562,  1.3828, -0.0659, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438, -1.7344,  1.6953,  0.8242, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8125, -1.6953,  1.7109,  0.5742, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:04,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 82.71 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1094, -2.4062,  0.2617,  1.7031, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6250, -1.7266,  1.0781,  0.7734, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4375, -2.1719,  1.9062,  0.2012, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -4.0312, -1.5547,  0.3359, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:05,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:58:05,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.76 | bwd_microstep: 1462.32 | bwd_inner_microstep: 2.56 | bwd_allreduce_microstep: 1459.60 | step_microstep: 3.04
[2025-11-06 17:58:05,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.47 | bwd: 1463.14 | bwd_inner: 3.33 | bwd_allreduce: 1459.63 | step: 3.13
 14%|█▍        | 491/3507 [13:19<1:19:04,  1.57s/it]                                                    {'loss': 0.4219, 'learning_rate': 1.937425729726799e-05, 'epoch': 0.14}
 14%|█▍        | 491/3507 [13:19<1:19:04,  1.57s/it]tensor([[-2.8594, -1.9531,  0.8320,  0.5703, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -2.8906,  0.1494,  0.6406, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -2.4219,  1.4688, -0.2559, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -3.5156,  0.4219,  0.2754, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:06,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.92 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.0625, -2.9844,  0.4219, -0.0500, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -3.1250,  0.0840,  1.2109, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0781, -1.2188,  1.1719,  0.7578, -1.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2656, -2.2188,  1.2578,  1.0625, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:06,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:58:06,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.73 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.72
[2025-11-06 17:58:06,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.68 | bwd: 3.20 | bwd_inner: 2.22 | bwd_allreduce: 0.85 | step: 2.81
 14%|█▍        | 492/3507 [13:20<1:04:14,  1.28s/it]                                                    {'loss': 0.8074, 'learning_rate': 1.9371037019680017e-05, 'epoch': 0.14}
 14%|█▍        | 492/3507 [13:20<1:04:14,  1.28s/it]tensor([[-3.0000, -2.4219, -0.1875,  1.3672, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625, -1.5391,  0.4941,  2.2656, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2969, -1.7188,  0.3828,  1.6719, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7344, -0.4844,  3.1719,  1.2500, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:07,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.90 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.5312, -3.9219, -1.1953,  0.8750, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -3.0312, -0.0923,  1.5391, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1406, -1.9453,  1.9062,  0.7070, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9375, -2.1875,  0.7383,  1.8281, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:10,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 17:58:10,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.39 | bwd_microstep: 3284.06 | bwd_inner_microstep: 2.30 | bwd_allreduce_microstep: 3281.62 | step_microstep: 2.41
[2025-11-06 17:58:10,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 428.34 | bwd: 3285.06 | bwd_inner: 3.22 | bwd_allreduce: 3281.67 | step: 2.50
 14%|█▍        | 493/3507 [13:24<1:44:15,  2.08s/it]                                                    {'loss': 0.4386, 'learning_rate': 1.9367808746067768e-05, 'epoch': 0.14}
 14%|█▍        | 493/3507 [13:24<1:44:15,  2.08s/it]tensor([[-2.0781, -1.3906,  1.1094,  2.3125, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -2.4844,  1.3438,  0.7266, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7969, -1.4922,  2.2812,  0.2734, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -3.3906,  0.6367,  0.2598, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:10,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.66 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.3906, -2.3594,  1.3438,  1.1797, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1719, -1.4609,  0.9531,  1.6953, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.0625, -3.1719,  0.1768,  0.4688, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.1484,  0.0369,  3.0000,  0.6797, -0.8711]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:58:10,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:58:10,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.74 | bwd_microstep: 29.31 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 28.11 | step_microstep: 1.46
[2025-11-06 17:58:10,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.43 | bwd: 30.27 | bwd_inner: 1.99 | bwd_allreduce: 28.15 | step: 1.54
 14%|█▍        | 494/3507 [13:24<1:20:01,  1.59s/it]                                                    {'loss': 0.8373, 'learning_rate': 1.9364572479185824e-05, 'epoch': 0.14}
 14%|█▍        | 494/3507 [13:24<1:20:01,  1.59s/it]tensor([[-1.8281, -1.1953,  1.1719,  2.5000, -1.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719, -1.1875,  1.5703,  0.3965, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3047, -0.8281,  0.9961,  2.7344, -0.8555]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:11,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.22 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.1250, -5.3438, -1.8906, -0.1992, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7812, -1.7266,  1.6797,  1.1875, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -4.5312, -0.5742, -0.1387, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -3.1719, -0.6133,  1.2266, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4688, -2.9375, -0.6719,  1.2969, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:13,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 17:58:13,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.78 | bwd_microstep: 2137.19 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 2135.98 | step_microstep: 2.73
[2025-11-06 17:58:13,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.03 | bwd: 2138.23 | bwd_inner: 2.07 | bwd_allreduce: 2136.02 | step: 2.82
 14%|█▍        | 495/3507 [13:27<1:34:04,  1.87s/it]                                                    {'loss': 0.418, 'learning_rate': 1.93613282217956e-05, 'epoch': 0.14}
 14%|█▍        | 495/3507 [13:27<1:34:04,  1.87s/it]tensor([[-2.0312, -1.2891,  1.2031,  1.6406, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -2.8906,  0.4922,  0.3496, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -3.2812, -0.4375,  1.2578, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -2.7656,  0.4062,  0.6445, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:13,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2812, -3.4062,  0.0757,  1.0781, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0000, -1.8984,  1.4062,  0.3047, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -3.3594, -0.5547,  0.8359, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.5625,  0.8984,  1.1484, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:13,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 17:58:13,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 1.71 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.17
[2025-11-06 17:58:13,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.51 | bwd: 2.52 | bwd_inner: 1.54 | bwd_allreduce: 0.85 | step: 2.25
 14%|█▍        | 496/3507 [13:27<1:12:35,  1.45s/it]                                                    {'loss': 0.5322, 'learning_rate': 1.935807597666532e-05, 'epoch': 0.14}
 14%|█▍        | 496/3507 [13:27<1:12:35,  1.45s/it]tensor([[-5.3750, -4.5938, -1.2656,  0.2197, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4844, -2.8906, -0.3457,  1.5547, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:14,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.50 | bwd_microstep: 2.36 | bwd_inner_microstep: 2.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.6250, -2.6562,  0.8047,  0.8867, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -2.9375,  0.2812,  1.0000, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -3.7969, -0.0928, -0.1328, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9922, -0.8672,  2.3906,  0.5273, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812, -2.2344, -0.1177,  1.4375, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -3.0469, -0.9141,  1.0000, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 17:58:16,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.15
[2025-11-06 17:58:16,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.01 | bwd_microstep: 2380.28 | bwd_inner_microstep: 2.46 | bwd_allreduce_microstep: 2377.71 | step_microstep: 1.82
[2025-11-06 17:58:16,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.54 | bwd: 2382.65 | bwd_inner: 4.71 | bwd_allreduce: 2377.77 | step: 1.92
 14%|█▍        | 497/3507 [13:30<1:32:20,  1.84s/it]                                                    {'loss': 0.9189, 'learning_rate': 1.9354815746570033e-05, 'epoch': 0.14}
 14%|█▍        | 497/3507 [13:30<1:32:20,  1.84s/it]tensor([[-4.7500, -3.4062,  0.8477, -1.0781, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:16,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.51 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.7188, -3.9375, -0.6719,  0.2129, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.0938,  0.4180,  0.1875, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -4.6875, -0.4297, -1.3828, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719, -2.0312,  0.3672,  1.5391, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5469, -1.5547,  1.3750,  0.3359, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.9922, -0.8555,  2.4062,  1.4375, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.9688, -0.5156,  0.9414, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:58:17,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:58:17,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.05 | bwd_microstep: 108.80 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 107.40 | step_microstep: 1.82
[2025-11-06 17:58:17,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.58 | bwd: 109.91 | bwd_inner: 2.35 | bwd_allreduce: 107.44 | step: 1.91
 14%|█▍        | 498/3507 [13:30<1:11:28,  1.43s/it]                                                    {'loss': 0.9449, 'learning_rate': 1.935154753429159e-05, 'epoch': 0.14}
 14%|█▍        | 498/3507 [13:30<1:11:28,  1.43s/it]tensor([[-4.1875, -3.0156,  0.9492,  0.0884, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:17,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 71.11 | bwd_microstep: 1.29 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3438, -1.6172,  1.0234,  2.0312, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -5.0938, -1.0859, -0.8398, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.3750, -0.9375, -0.4141, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1250, -2.5156, -0.0466,  1.5078, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -2.1094,  1.9297,  0.1543, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -3.5156,  0.0300,  0.1553, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -4.0312,  0.0137,  0.0391, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:19,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.34 | optimizer_step: 0.49
[2025-11-06 17:58:19,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.61 | bwd_microstep: 2472.04 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 2470.51 | step_microstep: 3.58
[2025-11-06 17:58:19,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 255.64 | bwd: 2473.34 | bwd_inner: 2.61 | bwd_allreduce: 2470.58 | step: 3.65
 14%|█▍        | 499/3507 [13:33<1:31:34,  1.83s/it]                                                    {'loss': 0.5176, 'learning_rate': 1.9348271342618657e-05, 'epoch': 0.14}
 14%|█▍        | 499/3507 [13:33<1:31:34,  1.83s/it]tensor([[-3.8281, -3.2344, -0.6484,  1.4141, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -5.3125, -2.1250, -0.3613, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -2.2344,  0.2773,  1.6172, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:20,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.62 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9375, -2.9375,  0.6250,  0.4941, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -2.3438,  1.0078,  1.2812, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -2.0469,  0.4258,  1.7578, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -3.4844,  0.2080,  0.5234, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7969, -1.6875,  1.9531,  0.4922, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:20,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 17:58:20,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.62 | bwd_microstep: 139.65 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 138.56 | step_microstep: 1.35
[2025-11-06 17:58:20,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.27 | bwd: 140.66 | bwd_inner: 1.93 | bwd_allreduce: 138.60 | step: 1.44
 14%|█▍        | 500/3507 [13:34<1:12:21,  1.44s/it]                                                    {'loss': 0.3682, 'learning_rate': 1.9344987174346712e-05, 'epoch': 0.14}
 14%|█▍        | 500/3507 [13:34<1:12:21,  1.44s/it]tensor([[-4.8750, -4.3750, -1.9531,  0.4883, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:20,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.47 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3438, -2.6250,  0.1553,  1.4375, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.0156,  0.6406,  0.4922, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3828, -0.3398,  2.7500,  1.3281, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -4.2188, -0.0369,  0.3203, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.3438,  1.0312, -0.2695, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -3.3594, -0.9297,  0.7656, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -5.0000, -0.8867,  0.0255, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:22,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 17:58:22,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.44 | bwd_microstep: 1901.88 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 1900.55 | step_microstep: 1.76
[2025-11-06 17:58:22,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 271.93 | bwd: 1903.11 | bwd_inner: 2.39 | bwd_allreduce: 1900.59 | step: 1.84
 14%|█▍        | 501/3507 [13:36<1:23:45,  1.67s/it]                                                    {'loss': 0.3528, 'learning_rate': 1.9341695032278038e-05, 'epoch': 0.14}
 14%|█▍        | 501/3507 [13:36<1:23:45,  1.67s/it]tensor([[-4.5312, -3.2344,  1.0859, -0.1338, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -2.6094,  1.1953, -0.1001, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656, -2.0938,  1.5703,  0.6016, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:22,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.44 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1250, -3.0156,  0.5117,  0.2109, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6094, -0.1289,  1.4375,  2.6875, -0.2598]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -2.7969,  0.8398,  1.0156, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -2.7031,  0.3184,  0.7539, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8125, -1.9453,  1.0781,  1.5781, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:58:23,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:58:23,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.63 | bwd_microstep: 3.36 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 2.10 | step_microstep: 1.82
[2025-11-06 17:58:23,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.10 | bwd: 4.15 | bwd_inner: 1.91 | bwd_allreduce: 2.13 | step: 1.89
 14%|█▍        | 502/3507 [13:36<1:05:05,  1.30s/it]                                                    {'loss': 0.4426, 'learning_rate': 1.933839491922172e-05, 'epoch': 0.14}
 14%|█▍        | 502/3507 [13:36<1:05:05,  1.30s/it]tensor([[-2.8125, -2.3438, -0.3340,  1.4766, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062, -2.1250,  0.6953,  1.3359, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:23,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.55 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0469, -1.9297,  1.7344,  0.6992, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -2.8906,  0.9609,  0.9023, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469, -2.9531, -0.5078,  1.1875, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4688, -1.2812,  2.1562,  0.8398, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -2.9531, -0.6484,  1.4922, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -2.2344,  0.5117,  1.6172, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:26,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 17:58:26,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.36 | bwd_microstep: 2542.66 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 2541.38 | step_microstep: 2.40
[2025-11-06 17:58:26,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.94 | bwd: 2543.47 | bwd_inner: 1.92 | bwd_allreduce: 2541.42 | step: 2.47
 14%|█▍        | 503/3507 [13:39<1:30:23,  1.81s/it]                                                    {'loss': 0.4111, 'learning_rate': 1.9335086837993648e-05, 'epoch': 0.14}
 14%|█▍        | 503/3507 [13:39<1:30:23,  1.81s/it]tensor([[-5.1562, -4.6250, -2.0625,  0.3926, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.6562, -1.2812, -0.2676, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -2.5000,  0.2910,  1.4453, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -1.5781,  2.0469,  0.2471, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:26,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.72 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-5.3125, -4.5312, -1.3125,  0.2500, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -2.2031,  0.3027,  1.4141, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -2.7188,  1.3594,  0.4277, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -1.9375,  2.1094, -0.3242, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:26,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:58:26,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.02 | bwd_microstep: 1.91 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.84 | step_microstep: 12.34
[2025-11-06 17:58:26,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.77 | bwd: 2.97 | bwd_inner: 1.90 | bwd_allreduce: 0.89 | step: 12.44
 14%|█▍        | 504/3507 [13:40<1:10:43,  1.41s/it]                                                    {'loss': 0.2427, 'learning_rate': 1.9331770791416504e-05, 'epoch': 0.14}
 14%|█▍        | 504/3507 [13:40<1:10:43,  1.41s/it]tensor([[-3.7969, -3.3438, -1.2188,  1.3281, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -3.3750, -0.4648,  0.9688, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -2.0156,  1.2891,  1.0938, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:26,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.94 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11
tensor([[-3.7031, -2.9062,  0.1162,  0.7695, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[1.4531, 2.3281, 4.3438, 3.4219, 1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3438, -1.3516,  1.5625,  0.6680, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9062, -2.3594, -0.0366,  1.9141, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -3.2500, -0.6641,  1.4297, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:30,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 17:58:30,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.85 | bwd_microstep: 3406.85 | bwd_inner_microstep: 8.65 | bwd_allreduce_microstep: 3398.08 | step_microstep: 2.19
[2025-11-06 17:58:30,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.81 | bwd: 3408.09 | bwd_inner: 9.71 | bwd_allreduce: 3398.17 | step: 2.30
 14%|█▍        | 505/3507 [13:44<1:47:18,  2.14s/it]                                                    {'loss': 0.3566, 'learning_rate': 1.932844678231977e-05, 'epoch': 0.14}
 14%|█▍        | 505/3507 [13:44<1:47:18,  2.14s/it]tensor([[-2.3125, -1.7422,  0.4414,  1.9453, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -2.1719,  0.9766,  1.3906, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6875, -1.7656,  1.2578,  1.1953, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:30,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.24 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7812, -2.0156,  0.6328,  0.9805, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -2.8438, -0.5352,  1.4141, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -2.4375, -0.3047,  1.3438, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -3.4219,  0.1162, -0.0806, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -3.1719, -0.6953,  1.4531, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:58:30,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 17:58:30,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.58 | bwd_microstep: 30.24 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 29.13 | step_microstep: 1.77
[2025-11-06 17:58:30,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.85 | bwd: 31.04 | bwd_inner: 1.71 | bwd_allreduce: 29.17 | step: 1.83
 14%|█▍        | 506/3507 [13:44<1:21:36,  1.63s/it]                                                    {'loss': 0.4603, 'learning_rate': 1.932511481353973e-05, 'epoch': 0.14}
 14%|█▍        | 506/3507 [13:44<1:21:36,  1.63s/it]tensor([[-3.0312, -2.3438,  0.2500,  1.1797, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2969, -0.9648,  2.4062,  0.1260, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:58:31,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8750, -1.9453,  1.1719,  0.9648, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -2.3438,  0.7188,  1.2031, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3438, -4.8438, -0.0674, -2.4375, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1250, -1.9453,  1.4375, -0.1836, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4844, -2.3438,  1.3125,  0.5859, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5625, -1.9062,  0.5742,  1.5781, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:58:31,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 17:58:31,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.57 | bwd_microstep: 137.61 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 136.60 | step_microstep: 1.52
[2025-11-06 17:58:31,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.70 | bwd: 138.53 | bwd_inner: 1.77 | bwd_allreduce: 136.64 | step: 1.60
 14%|█▍        | 507/3507 [13:45<1:03:50,  1.28s/it]                                                    {'loss': 0.7996, 'learning_rate': 1.9321774887919452e-05, 'epoch': 0.14}
 14%|█▍        | 507/3507 [13:45<1:03:50,  1.28s/it]tensor([[-2.4844, -1.5000,  1.4609,  0.4668, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7656, -1.3281,  0.4219,  2.3750, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:31,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.77 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.0938, -5.2500, -1.7578, -0.3887, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4062, -1.0625,  2.9219,  0.2363, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7188, -1.4062,  2.3750, -0.2500, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -3.5000, -0.4648,  0.8281, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -4.8125, -0.6094,  0.1543, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.8125,  1.3203, -0.3125, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:58:31,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:58:31,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.14 | bwd_microstep: 140.58 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 139.44 | step_microstep: 1.31
[2025-11-06 17:58:31,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.93 | bwd: 141.45 | bwd_inner: 1.85 | bwd_allreduce: 139.48 | step: 1.38
 14%|█▍        | 508/3507 [13:45<52:37,  1.05s/it]                                                    {'loss': 0.4396, 'learning_rate': 1.9318427008308785e-05, 'epoch': 0.14}
 14%|█▍        | 508/3507 [13:45<52:37,  1.05s/it]tensor([[-1.5156, -0.6406,  2.0469,  1.8516, -1.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -3.0156, -0.0894,  0.8828, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3281, -1.2422,  1.9297, -0.0205, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:32,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.83 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.18
tensor([[-3.7188, -3.0156, -0.2344,  0.8281, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5000, -2.4688,  0.9141,  0.7227, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -3.7500,  0.4414, -0.8672, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0781, -1.8828,  1.6406,  0.1719, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -2.2969,  1.1719, -0.3672, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:34,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:58:34,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.62 | bwd_microstep: 1767.16 | bwd_inner_microstep: 1.68 | bwd_allreduce_microstep: 1765.39 | step_microstep: 1.67
[2025-11-06 17:58:34,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.48 | bwd: 1768.20 | bwd_inner: 2.64 | bwd_allreduce: 1765.43 | step: 1.85
 15%|█▍        | 509/3507 [13:47<1:09:39,  1.39s/it]                                                    {'loss': 0.5026, 'learning_rate': 1.931507117756438e-05, 'epoch': 0.15}
 15%|█▍        | 509/3507 [13:47<1:09:39,  1.39s/it]tensor([[-2.2500, -1.4844,  1.0781,  1.6406, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6484, -0.6055,  2.5625,  1.2109, -1.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:34,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.09 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -3.5312, -0.8008,  0.8398, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4688, -2.5469,  0.8242,  1.3516, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0391,  0.0388,  3.0156,  1.6484, -0.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5625, -1.2656,  2.4062,  0.0464, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7578, -0.8359,  2.2188,  2.2969, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1719, -2.2031,  0.9453,  0.3418, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:34,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:58:34,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.02 | bwd_microstep: 234.91 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 233.85 | step_microstep: 1.59
[2025-11-06 17:58:34,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.15 | bwd: 235.77 | bwd_inner: 1.75 | bwd_allreduce: 233.89 | step: 1.67
 15%|█▍        | 510/3507 [13:48<56:54,  1.14s/it]                                                    {'loss': 0.3997, 'learning_rate': 1.931170739854967e-05, 'epoch': 0.15}
 15%|█▍        | 510/3507 [13:48<56:54,  1.14s/it]tensor([[-2.8906, -1.7422,  1.8594,  0.2012, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:34,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 116.54 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8125, -4.7500, -0.7656, -0.7656, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -2.0000,  2.2344, -0.4980, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9062, -2.6094,  1.6016,  0.0610, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -2.5312,  1.3203,  0.7070, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -2.8750,  0.6445,  0.6602, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -2.9688,  0.9258,  0.6055, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3750, -1.2109,  1.9531, -0.4355, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:35,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 17:58:35,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 268.80 | bwd_microstep: 352.40 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 351.21 | step_microstep: 1.87
[2025-11-06 17:58:35,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.28 | bwd: 353.20 | bwd_inner: 1.82 | bwd_allreduce: 351.25 | step: 1.94
 15%|█▍        | 511/3507 [13:49<53:15,  1.07s/it]                                                  {'loss': 0.4333, 'learning_rate': 1.930833567413486e-05, 'epoch': 0.15}
 15%|█▍        | 511/3507 [13:49<53:15,  1.07s/it]tensor([[-4.1250, -3.1250,  0.3848,  0.5977, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:35,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.75 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2031, -2.4219,  0.4902,  1.5312, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6719, -1.2969,  2.4375, -0.4668, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9688, -2.9531,  0.7148,  0.6211, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9531, -1.7344,  2.0938,  0.0613, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -3.6875, -0.8203,  1.2656, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -3.1250,  0.2227,  0.4668, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.0156,  1.2969, -0.0698, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:58:37,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:58:37,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.75 | bwd_microstep: 1581.96 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 1580.64 | step_microstep: 1.87
[2025-11-06 17:58:37,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.52 | bwd: 1583.06 | bwd_inner: 2.26 | bwd_allreduce: 1580.68 | step: 1.96
 15%|█▍        | 512/3507 [13:51<1:07:10,  1.35s/it]                                                    {'loss': 0.8384, 'learning_rate': 1.9304956007196943e-05, 'epoch': 0.15}
 15%|█▍        | 512/3507 [13:51<1:07:10,  1.35s/it]tensor([[-4.1562, -3.2500,  0.1084,  0.5000, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625, -2.7344,  0.5117,  1.7891, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7812,  0.2002,  2.9219,  1.7734, -0.4824]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.2188,  0.4746,  0.7891, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:37,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 338.41 | bwd_microstep: 1.31 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.7656, -2.2188, -0.0430,  1.5469, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -3.1406, -1.2422,  1.2500, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8594, -0.8906,  2.1094,  1.8438, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1719, -0.9453,  2.4531,  0.2852, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:38,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:58:38,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.08 | bwd_microstep: 550.30 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 549.21 | step_microstep: 1.61
[2025-11-06 17:58:38,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 481.53 | bwd: 551.61 | bwd_inner: 2.23 | bwd_allreduce: 549.25 | step: 1.70
 15%|█▍        | 513/3507 [13:52<1:03:04,  1.26s/it]                                                    {'loss': 0.3602, 'learning_rate': 1.9301568400619693e-05, 'epoch': 0.15}
 15%|█▍        | 513/3507 [13:52<1:03:04,  1.26s/it]tensor([[-2.6562, -1.2734,  2.5000, -0.3281, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:38,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.71 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1875, -3.3750, -0.2109,  1.0938, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -2.1562,  0.5898,  1.8203, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2969, -2.3281,  1.0312,  1.2266, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -2.5000, -0.1846,  1.6016, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1094, -2.3906,  0.3496,  1.3750, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6250, -5.4375, -1.2266, -1.6953, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3906, -1.2969,  1.9844,  0.9102, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:58:40,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 17:58:40,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.26 | bwd_microstep: 1315.99 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1314.69 | step_microstep: 2.06
[2025-11-06 17:58:40,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 276.98 | bwd: 1317.17 | bwd_inner: 2.28 | bwd_allreduce: 1314.75 | step: 2.15
 15%|█▍        | 514/3507 [13:54<1:08:26,  1.37s/it]                                                    {'loss': 0.3578, 'learning_rate': 1.929817285729364e-05, 'epoch': 0.15}
 15%|█▍        | 514/3507 [13:54<1:08:26,  1.37s/it]tensor([[-4.4062, -3.1094,  1.1953, -0.2188, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8750, -2.3750, -0.2500,  1.9609, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.0000, -1.9375,  1.3594,  0.5078, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2812, -1.4062,  1.3828,  1.3672, -1.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:40,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.04 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1250, -3.9219, -0.0928, -1.0391, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0156, -2.2969,  0.3477,  1.3672, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -2.9688,  1.1328,  0.1816, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -2.8125,  0.0684,  1.2500, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:41,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.14 | optimizer_step: 0.14
[2025-11-06 17:58:41,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.00 | bwd_microstep: 897.34 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 896.12 | step_microstep: 1.50
[2025-11-06 17:58:41,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.06 | bwd: 898.34 | bwd_inner: 2.05 | bwd_allreduce: 896.16 | step: 1.59
 15%|█▍        | 515/3507 [13:55<1:07:26,  1.35s/it]                                                    {'loss': 0.881, 'learning_rate': 1.9294769380116117e-05, 'epoch': 0.15}
 15%|█▍        | 515/3507 [13:55<1:07:26,  1.35s/it]tensor([[-3.8125, -2.8125,  0.7031,  0.7266, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1719, -2.6562, -0.5820,  1.4219, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -3.1250,  0.1309,  0.5156, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:41,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.70 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.6328, -0.3613,  2.9375,  0.1895, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4062, -1.5391,  1.1797,  0.9570, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.8281, 2.6250, 4.5625, 4.1875, 1.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.4004, 1.0391, 2.9219, 3.7031, 0.6523]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3750,  0.4785,  3.0625,  3.0156, -0.0645]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:42,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 17:58:42,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 92.83 | bwd_microstep: 899.59 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 898.36 | step_microstep: 2.04
[2025-11-06 17:58:42,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 259.55 | bwd: 900.56 | bwd_inner: 2.02 | bwd_allreduce: 898.41 | step: 2.13
 15%|█▍        | 516/3507 [13:56<1:04:59,  1.30s/it]                                                    {'loss': 0.6452, 'learning_rate': 1.9291357971991193e-05, 'epoch': 0.15}
 15%|█▍        | 516/3507 [13:56<1:04:59,  1.30s/it]tensor([[-3.8281, -2.7344,  1.2344,  0.7773, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -2.6562, -0.5664,  1.0156, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1562, -1.9766,  1.6016,  0.1904, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:42,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.29 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1250, -2.1250,  1.1562,  0.6406, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7969, -1.5234,  2.2031,  0.1318, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -3.8594, -0.6680,  0.4609, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -3.5469,  0.1523,  0.6914, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -3.7188, -0.9844,  0.5977, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:45,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.25 | optimizer_step: 0.26
[2025-11-06 17:58:45,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.59 | bwd_microstep: 2517.05 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2515.91 | step_microstep: 3.00
[2025-11-06 17:58:45,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.92 | bwd: 2518.09 | bwd_inner: 1.98 | bwd_allreduce: 2515.95 | step: 3.08
 15%|█▍        | 517/3507 [13:59<1:28:51,  1.78s/it]                                                    {'loss': 0.5769, 'learning_rate': 1.928793863582973e-05, 'epoch': 0.15}
 15%|█▍        | 517/3507 [13:59<1:28:51,  1.78s/it]tensor([[-2.7656, -2.2188,  0.0654,  1.8516, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -2.2656,  0.4102,  1.9219, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7344, -2.0625,  0.6172,  2.2500, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -2.3906,  2.0312,  0.2041, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:45,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.13 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.7188, -1.5156,  1.6875, -0.1338, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.0781, -1.1797,  1.5312,  1.3047, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6562, -1.8516,  0.9961,  1.9688, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8906, -2.6875,  1.3047,  0.3535, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:58:46,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:58:46,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.06 | bwd_microstep: 95.01 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 93.74 | step_microstep: 2.28
[2025-11-06 17:58:46,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.02 | bwd: 95.91 | bwd_inner: 2.00 | bwd_allreduce: 93.78 | step: 2.36
 15%|█▍        | 518/3507 [13:59<1:10:33,  1.42s/it]                                                    {'loss': 0.7246, 'learning_rate': 1.9284511374549338e-05, 'epoch': 0.15}
 15%|█▍        | 518/3507 [13:59<1:10:33,  1.42s/it]tensor([[-2.6719, -2.0469,  0.4922,  2.4062, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:46,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.35 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5000, -2.0625,  2.1719, -0.6055, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -4.7500, -1.6875,  0.3555, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -3.0938,  0.2793,  0.9219, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625, -1.3281,  2.2344,  0.6953, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -4.2188, -1.0938,  0.0947, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4531, -1.8047,  0.6484,  1.9141, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -1.2188,  2.0156, -0.0698, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:47,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.36 | optimizer_step: 0.36
[2025-11-06 17:58:47,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.93 | bwd_microstep: 708.33 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 706.59 | step_microstep: 3.38
[2025-11-06 17:58:47,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.31 | bwd: 709.26 | bwd_inner: 2.42 | bwd_allreduce: 706.66 | step: 3.46
 15%|█▍        | 519/3507 [14:01<1:04:54,  1.30s/it]                                                    {'loss': 0.4211, 'learning_rate': 1.92810761910744e-05, 'epoch': 0.15}
 15%|█▍        | 519/3507 [14:01<1:04:54,  1.30s/it]tensor([[-3.2656, -2.4688,  0.3672,  1.0312, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -2.4062,  0.0630,  1.5000, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7500, -2.2500, -0.2031,  1.3984, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.2656,  1.8203, -0.3691, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:47,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 317.49 | bwd_microstep: 1.80 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-2.1094, -1.5938,  0.1953,  1.6484, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-2.8594, -1.8438,  1.3281,  0.7461, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.4883,  0.1172,  1.9766,  2.8906, -0.1235]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -3.2031,  0.2930,  1.1328, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:48,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 17:58:48,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.82 | bwd_microstep: 1149.44 | bwd_inner_microstep: 1.99 | bwd_allreduce_microstep: 1147.32 | step_microstep: 2.61
[2025-11-06 17:58:48,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 460.31 | bwd: 1151.23 | bwd_inner: 3.62 | bwd_allreduce: 1147.40 | step: 2.76
 15%|█▍        | 520/3507 [14:02<1:10:13,  1.41s/it]                                                    {'loss': 0.8365, 'learning_rate': 1.9277633088336053e-05, 'epoch': 0.15}
 15%|█▍        | 520/3507 [14:02<1:10:13,  1.41s/it]tensor([[-2.3906, -1.4219,  1.5781,  1.1172, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.4531, -2.4219,  0.8789,  0.2695, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -4.0938, -0.3691, -0.1357, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688, -1.6250,  2.1094, -0.7773, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:49,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.70 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.2188, -1.8281,  2.1094, -0.3379, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.0000,  0.5977,  0.4434, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8906, -0.3789,  1.5625,  3.7344, -0.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9531, -2.6406,  1.5781, -0.2812, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:58:49,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.25 | optimizer_step: 0.29
[2025-11-06 17:58:49,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.84 | bwd_microstep: 69.74 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 68.63 | step_microstep: 3.90
[2025-11-06 17:58:49,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 457.57 | bwd: 70.77 | bwd_inner: 1.97 | bwd_allreduce: 68.67 | step: 3.99
 15%|█▍        | 521/3507 [14:03<57:43,  1.16s/it]                                                    {'loss': 0.7735, 'learning_rate': 1.9274182069272194e-05, 'epoch': 0.15}
 15%|█▍        | 521/3507 [14:03<57:43,  1.16s/it]tensor([[0.1816, 0.9648, 3.2031, 3.3125, 0.4355]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -2.7188, -0.0879,  1.5312, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:49,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.69 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-3.7031, -2.6250,  1.1016,  0.4922, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7031, -2.7969,  0.4395,  0.9883, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4531, -1.3750,  1.5391,  0.1875, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000, -1.9688,  0.2295,  2.0312, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.6719,  0.3906,  0.3926, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -2.3438,  0.2656,  1.8438, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:51,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.21 | optimizer_step: 0.27
[2025-11-06 17:58:51,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.73 | bwd_microstep: 1466.41 | bwd_inner_microstep: 1.63 | bwd_allreduce_microstep: 1464.61 | step_microstep: 138.92
[2025-11-06 17:58:51,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.45 | bwd: 1468.20 | bwd_inner: 3.29 | bwd_allreduce: 1464.68 | step: 139.08
 15%|█▍        | 522/3507 [14:05<1:09:43,  1.40s/it]                                                    {'loss': 0.8163, 'learning_rate': 1.9270723136827478e-05, 'epoch': 0.15}
 15%|█▍        | 522/3507 [14:05<1:09:43,  1.40s/it]tensor([[-3.8594, -3.3750, -1.1953,  1.0391, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 17:58:51,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.70 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2969, -2.6719, -0.0923,  1.6719, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9688, -1.8906,  1.5234,  1.2656, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -2.2969,  1.8516,  0.6172, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5312, -4.1562,  0.2275, -1.2109, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -2.6406,  0.1855,  1.1484, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2500, -1.0469,  2.2812,  0.7969, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -2.7969,  0.9844,  0.8516, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:52,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:58:52,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.15 | bwd_microstep: 734.66 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 733.50 | step_microstep: 3.05
[2025-11-06 17:58:52,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.89 | bwd: 735.56 | bwd_inner: 1.90 | bwd_allreduce: 733.53 | step: 3.13
 15%|█▍        | 523/3507 [14:06<1:05:40,  1.32s/it]                                                    {'loss': 0.866, 'learning_rate': 1.9267256293953298e-05, 'epoch': 0.15}
 15%|█▍        | 523/3507 [14:06<1:05:40,  1.32s/it]tensor([[-3.0156, -2.4688, -0.2949,  1.9609, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:52,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.55 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3125, -2.8281, -0.8203,  1.4375, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[0.7344, 1.8906, 4.6562, 1.9844, 0.7930]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500, -1.3828,  2.3438, -0.1582, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -3.0938,  0.0684,  0.9453, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -3.1250, -0.7734,  1.1328, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -2.7812,  0.1318,  1.5547, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -2.4688,  0.3496,  1.4141, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:54,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.60 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 17:58:54,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 65.38 | bwd_microstep: 2188.18 | bwd_inner_microstep: 1.39 | bwd_allreduce_microstep: 2186.66 | step_microstep: 4.50
[2025-11-06 17:58:54,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 191.95 | bwd: 2189.06 | bwd_inner: 2.20 | bwd_allreduce: 2186.70 | step: 4.60
 15%|█▍        | 524/3507 [14:08<1:22:00,  1.65s/it]                                                    {'loss': 0.7026, 'learning_rate': 1.9263781543607817e-05, 'epoch': 0.15}
 15%|█▍        | 524/3507 [14:08<1:22:00,  1.65s/it]tensor([[-3.0312, -2.4844, -0.2793,  1.2578, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:55,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.60 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7188, -1.6016,  1.6719,  0.3457, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -2.7500,  0.5117,  0.4766, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5781, -1.7266,  1.0391,  0.8750, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.0469,  1.2656,  0.2324, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0312, -2.4375, -0.1157,  1.4922, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -2.5156,  0.1445,  1.0000, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -3.1875, -0.2578,  1.0156, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:58:55,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.16 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:58:55,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.60 | bwd_microstep: 76.42 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 75.20 | step_microstep: 3.12
[2025-11-06 17:58:55,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.22 | bwd: 77.54 | bwd_inner: 2.16 | bwd_allreduce: 75.25 | step: 3.19
 15%|█▍        | 525/3507 [14:09<1:03:21,  1.27s/it]                                                    {'loss': 0.5753, 'learning_rate': 1.9260298888755927e-05, 'epoch': 0.15}
 15%|█▍        | 525/3507 [14:09<1:03:21,  1.27s/it]tensor([[-3.2812, -2.6406, -0.0461,  1.5938, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5312, -1.7109,  1.1328,  1.9297, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:55,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.90 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2031, -2.3594,  0.6328,  0.8164, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7891, -0.4297,  3.2031,  0.3301, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -4.6562, -1.8516,  0.2471, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812, -1.9688,  2.0469,  0.2676, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -3.5625,  0.3926,  0.8281, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7031, -2.7344,  0.7539,  0.9570, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:58:56,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.26 | optimizer_step: 0.27
[2025-11-06 17:58:56,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.99 | bwd_microstep: 766.34 | bwd_inner_microstep: 2.74 | bwd_allreduce_microstep: 763.41 | step_microstep: 2.62
[2025-11-06 17:58:56,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.94 | bwd: 767.23 | bwd_inner: 3.58 | bwd_allreduce: 763.44 | step: 2.70
 15%|█▍        | 526/3507 [14:10<1:01:02,  1.23s/it]                                                    {'loss': 0.576, 'learning_rate': 1.9256808332369278e-05, 'epoch': 0.15}
 15%|█▍        | 526/3507 [14:10<1:01:02,  1.23s/it]tensor([[-3.5156, -2.3281,  0.8438, -0.3262, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.4531, -2.3281,  0.9141, -0.4395, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:56,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.62 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.1250, -1.8516,  1.9297, -0.1011, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -2.5000,  1.1719,  0.6211, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -2.7188,  1.4688, -0.5234, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4688, -2.1406,  1.9844,  0.2119, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312, -2.6406,  0.4492,  1.0234, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5469, -1.4922,  1.8906,  1.2656, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:58:57,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.14 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:58:57,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.52 | bwd_microstep: 167.32 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 166.26 | step_microstep: 2.90
[2025-11-06 17:58:57,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.16 | bwd: 168.12 | bwd_inner: 1.72 | bwd_allreduce: 166.29 | step: 2.97
 15%|█▌        | 527/3507 [14:10<51:29,  1.04s/it]                                                    {'loss': 0.7861, 'learning_rate': 1.9253309877426257e-05, 'epoch': 0.15}
 15%|█▌        | 527/3507 [14:10<51:29,  1.04s/it]tensor([[-4.7500, -3.6875,  0.0276, -0.2520, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:58:57,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.73 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5625, -1.3984,  1.7578, -0.1260, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.3750, -2.7188, -0.1621,  1.6641, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0781, -2.4531, -0.0116,  1.8203, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -2.0000,  1.9922,  0.2432, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5312, -1.6094,  1.2734,  0.7148, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.8281, -3.2188, -0.6016,  1.3750, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -2.5000,  0.4492,  1.2969, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:58:59,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:58:59,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.46 | bwd_microstep: 2232.77 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2231.69 | step_microstep: 2.29
[2025-11-06 17:58:59,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.21 | bwd: 2233.71 | bwd_inner: 1.86 | bwd_allreduce: 2231.72 | step: 2.37
 15%|█▌        | 528/3507 [14:13<1:14:35,  1.50s/it]                                                    {'loss': 1.2859, 'learning_rate': 1.9249803526911988e-05, 'epoch': 0.15}
 15%|█▌        | 528/3507 [14:13<1:14:35,  1.50s/it]tensor([[-4.0938, -2.7812,  1.3906, -0.1108, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:58:59,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.07 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0938, -2.0469,  1.1875,  0.1455, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -3.1250, -0.5547,  1.1250, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -2.9375,  0.3926,  0.7305, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.4844, -0.0327, -0.0172, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9531, -1.0312,  1.8125,  1.6562, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8750, -2.3438, -0.1572,  1.8594, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -3.1719, -1.0547,  1.1328, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 17:59:00,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 17:59:00,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.59 | bwd_microstep: 81.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 80.04 | step_microstep: 2.71
[2025-11-06 17:59:00,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.69 | bwd: 82.10 | bwd_inner: 1.89 | bwd_allreduce: 80.08 | step: 2.79
 15%|█▌        | 529/3507 [14:13<58:21,  1.18s/it]                                                    {'loss': 1.0697, 'learning_rate': 1.9246289283818334e-05, 'epoch': 0.15}
 15%|█▌        | 529/3507 [14:13<58:21,  1.18s/it]tensor([[-3.4531, -2.1094,  1.7422, -0.3477, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -3.4844, -0.4961,  0.9531, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:00,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.20 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.6406, -2.5625,  0.9336,  0.7148, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000, -1.6328,  2.2031, -0.4863, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -3.0938, -0.0693,  0.7969, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -3.9375,  0.4551, -0.1396, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -3.1406, -0.1553,  1.3828, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7969, -2.4375,  1.6875, -0.0209, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:02,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 17:59:02,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.61 | bwd_microstep: 2011.43 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 2010.40 | step_microstep: 2.66
[2025-11-06 17:59:02,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.83 | bwd: 2012.27 | bwd_inner: 1.70 | bwd_allreduce: 2010.44 | step: 2.75
 15%|█▌        | 530/3507 [14:16<1:15:22,  1.52s/it]                                                    {'loss': 0.5551, 'learning_rate': 1.9242767151143896e-05, 'epoch': 0.15}
 15%|█▌        | 530/3507 [14:16<1:15:22,  1.52s/it]tensor([[-3.3906, -2.6562,  0.1963,  1.8984, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9062, -1.6797,  1.8594,  0.6758, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 17:59:02,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.28 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -3.1094,  0.3555,  0.5039, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656, -2.3281,  0.8906,  1.1719, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -1.8906,  2.1094, -0.0654, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6094, -1.6250,  1.4766,  1.0469, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -2.4531,  1.5156,  0.0525, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -1.9141,  1.5547,  0.6211, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:59:02,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:59:02,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.03 | bwd_microstep: 173.01 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 172.09 | step_microstep: 2.77
[2025-11-06 17:59:02,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.34 | bwd: 173.93 | bwd_inner: 1.68 | bwd_allreduce: 172.12 | step: 2.85
 15%|█▌        | 531/3507 [14:16<1:00:51,  1.23s/it]                                                    {'loss': 0.8229, 'learning_rate': 1.9239237131894e-05, 'epoch': 0.15}
 15%|█▌        | 531/3507 [14:16<1:00:51,  1.23s/it]tensor([[-2.0781, -0.8867,  2.3594,  0.7578, -1.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:59:03,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.86 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.0625, -6.3438, -2.9219, -0.2754, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375, -2.8906,  0.6250,  0.3145, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -3.0312, -0.1982,  1.3047, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -3.4688, -0.6602,  0.8438, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -4.2500, -0.7305, -0.5039, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -3.6562, -1.3984,  1.4219, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6797, -1.1250,  0.7930,  2.4062, -1.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:59:03,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:59:03,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.23 | bwd_microstep: 594.14 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 592.99 | step_microstep: 2.28
[2025-11-06 17:59:03,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.11 | bwd: 595.12 | bwd_inner: 1.97 | bwd_allreduce: 593.03 | step: 2.37
 15%|█▌        | 532/3507 [14:17<56:28,  1.14s/it]                                                    {'loss': 0.8806, 'learning_rate': 1.923569922908071e-05, 'epoch': 0.15}
 15%|█▌        | 532/3507 [14:17<56:28,  1.14s/it]tensor([[-3.2656, -1.8906,  2.0469, -0.2441, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -2.3906, -0.4824,  1.7188, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.6094, -1.3047,  2.2656, -0.0884, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0156, -2.4688, -0.2461,  1.6719, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.2969, -0.4941,  1.0781, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:04,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.16 | bwd_microstep: 1.38 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5938, -1.4219,  1.9766,  0.8633, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7344, -2.1562,  0.1328,  2.2812, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -4.1250, -0.0669, -1.3203, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:04,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.27
[2025-11-06 17:59:04,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.07 | bwd_microstep: 252.53 | bwd_inner_microstep: 2.28 | bwd_allreduce_microstep: 250.08 | step_microstep: 2.37
[2025-11-06 17:59:04,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.26 | bwd: 253.90 | bwd_inner: 3.58 | bwd_allreduce: 250.12 | step: 2.46
 15%|█▌        | 533/3507 [14:18<53:27,  1.08s/it]                                                  {'loss': 0.6726, 'learning_rate': 1.923215344572281e-05, 'epoch': 0.15}
 15%|█▌        | 533/3507 [14:18<53:27,  1.08s/it]tensor([[-2.8750, -2.4062, -0.5742,  1.4766, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -3.9375, -1.3125,  0.7695, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -3.4375,  0.6836,  0.5117, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:05,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.45 | bwd_microstep: 1.54 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.14
tensor([[-4.0000, -2.8906,  0.8711,  0.7383, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4062, -2.7344, -0.1865,  1.2188, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.4375, -0.5820, -0.4824, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2812, -2.6719, -0.3027,  1.7578, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -2.4688, -0.5312,  0.7695, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 17:59:06,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.58 | optimizer_step: 0.50
[2025-11-06 17:59:06,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.14 | bwd_microstep: 1528.27 | bwd_inner_microstep: 1.94 | bwd_allreduce_microstep: 1526.09 | step_microstep: 5.25
[2025-11-06 17:59:06,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 484.64 | bwd: 1529.83 | bwd_inner: 3.40 | bwd_allreduce: 1526.18 | step: 5.41
 15%|█▌        | 534/3507 [14:20<1:08:09,  1.38s/it]                                                    {'loss': 0.4244, 'learning_rate': 1.922859978484581e-05, 'epoch': 0.15}
 15%|█▌        | 534/3507 [14:20<1:08:09,  1.38s/it]tensor([[-3.3125, -2.5469,  0.1318,  1.1953, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:07,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.86 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7812, -3.9219, -0.6484,  0.4336, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812, -2.3125,  0.8594,  1.1953, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -4.0000, -1.9688,  0.2295, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.1562, -1.6719,  0.1777,  2.2969, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2031, -2.2188,  0.8867,  0.8516, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.8438,  1.0078, -0.6445, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4219, -2.2500,  1.2422,  0.4766, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:59:07,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.25 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 17:59:07,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.65 | bwd_microstep: 90.86 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 89.69 | step_microstep: 3.09
[2025-11-06 17:59:07,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.52 | bwd: 91.82 | bwd_inner: 1.94 | bwd_allreduce: 89.73 | step: 3.18
 15%|█▌        | 535/3507 [14:21<53:57,  1.09s/it]                                                    {'loss': 0.9734, 'learning_rate': 1.922503824948194e-05, 'epoch': 0.15}
 15%|█▌        | 535/3507 [14:21<53:57,  1.09s/it]tensor([[-3.6250, -2.5156,  0.8164,  0.5508, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2969, -1.3828,  1.1953,  0.8828, -1.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:07,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2031, -1.1172,  1.8125,  1.0859, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -3.2031,  0.9102, -0.3848, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7656, -1.3672,  2.4531,  0.2324, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -3.2656, -0.5742,  1.1797, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -4.7188, -1.6016,  0.1162, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -3.7031, -0.3496,  0.5430, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:10,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:59:10,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.00 | bwd_microstep: 2752.60 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 2751.15 | step_microstep: 2.83
[2025-11-06 17:59:10,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.31 | bwd: 2753.51 | bwd_inner: 2.19 | bwd_allreduce: 2751.19 | step: 2.91
 15%|█▌        | 536/3507 [14:24<1:24:50,  1.71s/it]                                                    {'loss': 0.428, 'learning_rate': 1.9221468842670156e-05, 'epoch': 0.15}
 15%|█▌        | 536/3507 [14:24<1:24:50,  1.71s/it]tensor([[-4.0312, -3.1406, -0.0530,  0.4590, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9375, -0.7812,  1.9688,  0.0608, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:10,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.56 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.86 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.19
tensor([[-3.2969, -2.0000,  1.5859, -0.3125, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -2.8125,  1.8047, -0.4922, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812, -1.5000,  1.7578,  0.3633, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031, -2.5156, -0.0452,  1.5938, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -3.6562,  0.0608, -0.4062, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -2.9531,  0.6094, -0.4043, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:10,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:59:10,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.34 | bwd_microstep: 77.47 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 76.15 | step_microstep: 2.09
[2025-11-06 17:59:10,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.95 | bwd: 79.60 | bwd_inner: 3.14 | bwd_allreduce: 76.24 | step: 2.28
 15%|█▌        | 537/3507 [14:24<1:06:36,  1.35s/it]                                                    {'loss': 0.2983, 'learning_rate': 1.9217891567456123e-05, 'epoch': 0.15}
 15%|█▌        | 537/3507 [14:24<1:06:36,  1.35s/it]tensor([[-4.4062, -3.3750,  0.0991,  0.0630, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7188, -1.8984,  0.8516,  2.1406, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:11,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.23 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -3.3750, -0.2559,  0.7070, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0938, -2.0469,  0.9766,  0.3555, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5781, -2.7500,  0.0854,  1.3359, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8125, -2.1875, -0.0728,  1.6172, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9609, -1.3125,  0.8789,  2.2188, -1.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -3.4688,  0.6758, -1.0703, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 17:59:12,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 17:59:12,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.96 | bwd_microstep: 1301.97 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1300.92 | step_microstep: 3.02
[2025-11-06 17:59:12,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.21 | bwd: 1302.99 | bwd_inner: 1.92 | bwd_allreduce: 1300.95 | step: 3.11
 15%|█▌        | 538/3507 [14:26<1:11:57,  1.45s/it]                                                    {'loss': 0.7286, 'learning_rate': 1.921430642689222e-05, 'epoch': 0.15}
 15%|█▌        | 538/3507 [14:26<1:11:57,  1.45s/it]tensor([[-3.7188, -3.0312, -0.5039,  1.0547, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.2031,  0.0342,  1.2500, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2812, -2.2656,  0.9961,  1.4688, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -3.5156,  0.5547, -0.7227, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7188, -2.8594,  0.0884,  1.0312, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -2.4219,  1.3750,  0.3535, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -3.4219, -1.2031,  1.1016, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:13,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.57 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8750, -2.6406,  1.2266,  0.4844, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:13,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.84 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 17:59:13,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.04 | bwd_microstep: 1.94 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.93 | step_microstep: 4.32
[2025-11-06 17:59:13,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 475.63 | bwd: 3.12 | bwd_inner: 2.01 | bwd_allreduce: 0.97 | step: 4.41
 15%|█▌        | 539/3507 [14:27<1:05:00,  1.31s/it]                                                    {'loss': 0.4349, 'learning_rate': 1.9210713424037546e-05, 'epoch': 0.15}
 15%|█▌        | 539/3507 [14:27<1:05:00,  1.31s/it]tensor([[-3.7188, -3.1562, -0.9180,  1.1875, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:13,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.45 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7031, -3.1250, -0.8750,  1.4922, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0938, -2.0156,  1.3203,  1.2031, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0312, -0.1396,  2.1094,  0.9531, -0.7383]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3906, -2.0938,  1.8281,  0.2969, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -3.3906,  0.6523,  0.0679, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -2.6719,  0.0298,  1.7891, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8438, -3.2969, -1.0391,  0.9609, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:15,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 17:59:15,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.66 | bwd_microstep: 2013.45 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 2012.34 | step_microstep: 2.93
[2025-11-06 17:59:15,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.12 | bwd: 2014.47 | bwd_inner: 1.95 | bwd_allreduce: 2012.39 | step: 3.03
 15%|█▌        | 540/3507 [14:29<1:19:37,  1.61s/it]                                                    {'loss': 0.2996, 'learning_rate': 1.9207112561957894e-05, 'epoch': 0.15}
 15%|█▌        | 540/3507 [14:29<1:19:37,  1.61s/it]tensor([[-4.0312, -2.9688,  0.4922,  0.9180, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -2.1250,  1.1328,  1.0859, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -2.2188,  1.8828, -0.0654, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3750, -0.6211,  1.6719,  2.6719, -0.8359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:16,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.06 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.9844, -2.2500,  0.2656,  1.8984, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8438, -1.5000,  1.8125, -0.0155, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.5156, -0.7188,  0.6289, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -4.6875, -0.8359, -0.5547, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:16,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:59:16,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.52 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.50
[2025-11-06 17:59:16,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.58 | bwd: 2.99 | bwd_inner: 2.09 | bwd_allreduce: 0.78 | step: 2.58
 15%|█▌        | 541/3507 [14:30<1:03:28,  1.28s/it]                                                    {'loss': 0.3784, 'learning_rate': 1.920350384372578e-05, 'epoch': 0.15}
 15%|█▌        | 541/3507 [14:30<1:03:28,  1.28s/it]tensor([[-2.7656, -1.5938,  1.8750,  1.2188, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -1.7422,  2.1719,  0.0762, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938, -2.0000,  2.2656, -0.4355, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:16,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 282.38 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.6562, -2.1250,  1.7188, -1.0547, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7969, -1.6562,  1.6094,  0.8164, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -3.4219, -0.5273,  1.2031, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3438, -2.6094,  0.0039,  1.5469, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -3.3594, -0.1797,  0.4727, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:18,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.35 | optimizer_step: 0.35
[2025-11-06 17:59:18,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.22 | bwd_microstep: 1829.48 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 1828.21 | step_microstep: 3.39
[2025-11-06 17:59:18,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.62 | bwd: 1830.62 | bwd_inner: 2.24 | bwd_allreduce: 1828.26 | step: 3.48
 15%|█▌        | 542/3507 [14:32<1:18:04,  1.58s/it]                                                    {'loss': 0.439, 'learning_rate': 1.919988727242041e-05, 'epoch': 0.15}
 15%|█▌        | 542/3507 [14:32<1:18:04,  1.58s/it]tensor([[-5.8750, -4.6562, -0.8086, -1.0469, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -1.6562,  1.7109,  0.9922, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9375, -3.2656, -0.7188,  1.6094, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -2.5312,  1.1328, -0.0449, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -2.6406,  0.1924,  1.4688, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -1.2891,  2.2500,  0.3496, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5156, -1.9766, -0.1172,  1.1953, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:20,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.82 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.80 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.19
tensor([[-3.5000, -2.9062, -0.5703,  1.7109, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:20,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 17:59:20,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.87 | bwd_microstep: 1.74 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.64
[2025-11-06 17:59:20,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.75 | bwd: 3.83 | bwd_inner: 2.70 | bwd_allreduce: 0.90 | step: 2.83
 15%|█▌        | 543/3507 [14:34<1:18:44,  1.59s/it]                                                    {'loss': 0.296, 'learning_rate': 1.9196262851127695e-05, 'epoch': 0.15}
 15%|█▌        | 543/3507 [14:34<1:18:44,  1.59s/it]tensor([[-4.4375, -3.4531, -0.2373,  0.9727, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -1.5078,  1.0156,  2.3906, -1.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:20,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.29 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6562, -1.4922,  1.6641,  0.9570, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -3.0625, -0.7070,  1.5938, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -1.3750,  2.0625, -0.5664, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.6719, -1.2266,  2.2656, -0.5352, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8125, -2.1250,  0.1953,  1.6250, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -3.3125,  0.1914,  1.1250, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:21,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:59:21,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.30 | bwd_microstep: 414.21 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 413.16 | step_microstep: 3.23
[2025-11-06 17:59:21,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.61 | bwd: 415.41 | bwd_inner: 2.08 | bwd_allreduce: 413.19 | step: 3.30
 16%|█▌        | 544/3507 [14:35<1:08:03,  1.38s/it]                                                    {'loss': 0.7639, 'learning_rate': 1.9192630582940243e-05, 'epoch': 0.16}
 16%|█▌        | 544/3507 [14:35<1:08:03,  1.38s/it]tensor([[-1.6875, -1.0312,  0.9648,  1.7578, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2188, -2.1094,  1.3984,  1.2344, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406, -1.2109,  2.3750, -0.4102, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4531, -1.0469,  2.4688, -0.1777, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -2.7188,  0.1230,  1.4219, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406, -2.0938, -0.1543,  1.8438, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-1.7578, -0.8516,  1.7188,  1.8828, -1.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:22,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.86 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7500, -2.6250,  0.7344,  0.7383, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:23,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 17:59:23,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.44 | bwd_microstep: 1.96 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.03
[2025-11-06 17:59:23,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.32 | bwd: 2.98 | bwd_inner: 1.90 | bwd_allreduce: 0.94 | step: 2.11
 16%|█▌        | 545/3507 [14:36<1:15:37,  1.53s/it]                                                    {'loss': 0.9631, 'learning_rate': 1.918899047095737e-05, 'epoch': 0.16}
 16%|█▌        | 545/3507 [14:36<1:15:37,  1.53s/it]tensor([[-1.7500, -0.8398,  1.5938,  1.1016, -1.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:23,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.78 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5391, -0.4219,  2.4688,  1.4453, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -2.6250, -0.7070,  1.6094, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.8906,  0.4277,  0.2021, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -3.6406,  0.1235,  0.3926, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -2.6406,  1.4375,  0.4316, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -2.0781,  0.9805,  2.0469, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -2.7969,  0.0347,  1.4375, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:24,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.26
[2025-11-06 17:59:24,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.74 | bwd_microstep: 681.39 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 680.31 | step_microstep: 1.91
[2025-11-06 17:59:24,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.54 | bwd: 682.17 | bwd_inner: 1.69 | bwd_allreduce: 680.36 | step: 1.99
 16%|█▌        | 546/3507 [14:37<1:07:38,  1.37s/it]                                                    {'loss': 0.3987, 'learning_rate': 1.918534251828506e-05, 'epoch': 0.16}
 16%|█▌        | 546/3507 [14:37<1:07:38,  1.37s/it]tensor([[-3.3125, -2.8125, -0.8672,  1.3281, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -4.1875, -0.4453,  0.7070, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -4.0625, -0.3145, -0.6719, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438, -1.9688,  0.8672,  1.6797, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9688, -1.8828,  1.4453,  1.2891, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.2031,  0.7695, -0.1357, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.0222, 0.8828, 3.0156, 2.6250, 0.2793]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:25,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.60 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.3438, -3.4375, -0.3477,  0.6836, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:25,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 17:59:25,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.23 | bwd_microstep: 1.83 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.07
[2025-11-06 17:59:25,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.83 | bwd: 2.72 | bwd_inner: 1.76 | bwd_allreduce: 0.82 | step: 2.16
 16%|█▌        | 547/3507 [14:39<1:04:30,  1.31s/it]                                                    {'loss': 0.4904, 'learning_rate': 1.918168672803601e-05, 'epoch': 0.16}
 16%|█▌        | 547/3507 [14:39<1:04:30,  1.31s/it]tensor([[-1.2188, -0.6133,  1.2891,  2.5156, -0.7148]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:25,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.81 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8438, -0.8164,  1.9062,  1.3047, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -2.8438,  0.3730,  0.6133, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.2500, -0.2266,  0.3828, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6875, -1.3516,  1.7578, -0.7031, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2031, -1.9609,  1.3203,  0.1465, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6562, -3.1719,  0.9453, -1.3203, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -1.9531,  1.8672,  0.1172, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:25,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:59:25,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.78 | bwd_microstep: 340.86 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 339.71 | step_microstep: 1.65
[2025-11-06 17:59:25,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.62 | bwd: 341.74 | bwd_inner: 1.88 | bwd_allreduce: 339.74 | step: 1.72
 16%|█▌        | 548/3507 [14:39<55:11,  1.12s/it]                                                    {'loss': 1.1387, 'learning_rate': 1.9178023103329595e-05, 'epoch': 0.16}
 16%|█▌        | 548/3507 [14:39<55:11,  1.12s/it]tensor([[-3.6406, -2.7031,  0.3535,  1.3828, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594, -2.6094,  0.0439,  1.6406, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -1.8594,  1.0078,  1.0156, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8281, -1.7109,  1.4531,  0.5000, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5391, -0.7852,  1.3594,  1.8203, -1.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719, -0.7305,  2.6562, -0.4922, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.1250, -1.0703,  1.8359,  1.7266, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:27,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.36 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7500, -4.1250, -1.8516,  0.1533, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:27,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:59:27,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.09 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.16
[2025-11-06 17:59:27,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.46 | bwd: 3.10 | bwd_inner: 2.10 | bwd_allreduce: 0.86 | step: 2.26
 16%|█▌        | 549/3507 [14:41<1:04:02,  1.30s/it]                                                    {'loss': 0.8257, 'learning_rate': 1.917435164729187e-05, 'epoch': 0.16}
 16%|█▌        | 549/3507 [14:41<1:04:02,  1.30s/it]tensor([[-3.0938, -1.6719,  2.1250, -0.1348, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -1.8984,  0.6172,  2.2500, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -1.7344,  1.1094,  0.5430, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:59:27,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.67 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0312, -2.1719,  0.6523,  1.8828, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.8281,  0.8516,  0.3047, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594, -3.0781, -0.4121,  1.1484, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6719, -0.6953,  1.6953,  1.0234, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9688, -1.6328,  2.0625,  0.3262, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:28,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 17:59:28,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.01 | bwd_microstep: 59.53 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 58.43 | step_microstep: 2.21
[2025-11-06 17:59:28,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.67 | bwd: 60.37 | bwd_inner: 1.75 | bwd_allreduce: 58.47 | step: 2.29
 16%|█▌        | 550/3507 [14:41<51:13,  1.04s/it]                                                    {'loss': 0.7617, 'learning_rate': 1.917067236305559e-05, 'epoch': 0.16}
 16%|█▌        | 550/3507 [14:41<51:13,  1.04s/it]tensor([[-2.0469, -1.2578,  1.0938,  2.0469, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -2.4531,  0.2051,  1.2344, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781, -2.0625, -0.1226,  2.0469, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -2.5156,  0.5977,  0.4746, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -2.1719, -0.4336,  1.9922, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.2188,  0.9336,  0.3867, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:29,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.39 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.9688, -2.5938,  1.3281, -0.2598, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -4.0938, -0.0786, -1.4609, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 17:59:29,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.22 | optimizer_step: 0.22
[2025-11-06 17:59:29,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.46 | bwd_microstep: 2.28 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1.07 | step_microstep: 1.99
[2025-11-06 17:59:29,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.86 | bwd: 3.32 | bwd_inner: 2.04 | bwd_allreduce: 1.12 | step: 2.09
 16%|█▌        | 551/3507 [14:43<1:03:37,  1.29s/it]                                                    {'loss': 1.0757, 'learning_rate': 1.9166985253760165e-05, 'epoch': 0.16}
 16%|█▌        | 551/3507 [14:43<1:03:37,  1.29s/it]tensor([[-3.7031, -2.3125,  1.5625, -0.1279, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -2.1562,  0.9570,  0.7891, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:30,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.77 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.2031, -1.6250,  0.2344,  2.2969, -1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -2.5312,  0.9961,  0.0806, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000, -2.4844, -0.6094,  1.6484, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2344, -2.2969,  0.5820,  1.0781, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6562, -1.2031,  2.6406,  0.3223, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5312, -1.3359,  1.7109,  0.6914, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 17:59:31,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:59:31,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.27 | bwd_microstep: 1064.28 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1063.17 | step_microstep: 1.86
[2025-11-06 17:59:31,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.06 | bwd: 1065.22 | bwd_inner: 1.89 | bwd_allreduce: 1063.21 | step: 1.94
 16%|█▌        | 552/3507 [14:45<1:06:24,  1.35s/it]                                                    {'loss': 0.9746, 'learning_rate': 1.9163290322551704e-05, 'epoch': 0.16}
 16%|█▌        | 552/3507 [14:45<1:06:24,  1.35s/it]tensor([[-3.4531, -2.8125, -0.5625,  1.7266, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:31,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.74 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1094, -1.8438,  1.5625,  0.6406, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -2.8438, -0.2754,  1.2500, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.1719,  1.0000,  0.3262, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -3.2031, -0.2432,  1.4062, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5625, -2.0312,  1.9609, -0.4395, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7500, -2.8594, -0.0415,  1.1875, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -2.0156,  1.4844,  0.8203, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:59:31,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:59:31,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.61 | bwd_microstep: 73.59 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 72.43 | step_microstep: 1.50
[2025-11-06 17:59:31,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.37 | bwd: 74.65 | bwd_inner: 2.06 | bwd_allreduce: 72.47 | step: 1.58
 16%|█▌        | 553/3507 [14:45<52:18,  1.06s/it]                                                    {'loss': 0.2766, 'learning_rate': 1.9159587572582973e-05, 'epoch': 0.16}
 16%|█▌        | 553/3507 [14:45<52:18,  1.06s/it]tensor([[-3.3906, -1.8047,  2.1719, -0.7422, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -3.6406, -0.3164, -0.2402, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:32,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.48 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7500, -1.2188,  2.6562, -0.0277, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.1719,  0.0361,  2.8438,  1.3359, -0.8086]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.7344,  1.6953, -0.3027, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4219, -0.9922,  2.1250, -0.0194, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188, -2.2344,  0.6328,  0.6562, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -4.4062, -1.3203,  0.3105, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:34,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 17:59:34,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.56 | bwd_microstep: 2477.35 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2476.23 | step_microstep: 2.35
[2025-11-06 17:59:34,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.08 | bwd: 2478.22 | bwd_inner: 1.80 | bwd_allreduce: 2476.27 | step: 2.43
 16%|█▌        | 554/3507 [14:48<1:18:13,  1.59s/it]                                                    {'loss': 0.4918, 'learning_rate': 1.9155877007013424e-05, 'epoch': 0.16}
 16%|█▌        | 554/3507 [14:48<1:18:13,  1.59s/it]tensor([[-4.5000, -3.3750,  0.0840,  0.1216, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -2.6562, -0.3809,  1.6641, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594, -2.2656, -0.2969,  1.5859, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.9375, -0.2871, -0.2832, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:34,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.09 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.1406, -1.3750,  0.7930,  1.6562, -1.5391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -2.4375,  0.4766,  1.2891, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3438, -1.7812,  0.1230,  2.2969, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -2.8438,  0.8633,  0.5156, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:35,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 17:59:35,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.04 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.93
[2025-11-06 17:59:35,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.16 | bwd: 2.89 | bwd_inner: 1.83 | bwd_allreduce: 0.91 | step: 2.03
 16%|█▌        | 555/3507 [14:49<1:02:37,  1.27s/it]                                                    {'loss': 0.4626, 'learning_rate': 1.9152158629009168e-05, 'epoch': 0.16}
 16%|█▌        | 555/3507 [14:49<1:02:37,  1.27s/it]tensor([[-3.5312, -2.0625,  1.7266, -0.1289, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:35,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.53 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0469, -1.5234,  2.3438, -0.2578, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -2.4844,  0.2100,  1.5625, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062, -2.1250,  0.3984,  2.1094, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0000, -3.4375,  0.8750, -0.8164, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1709,  0.2383,  1.2891,  2.9375,  0.2451]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.5156, -2.7344, -0.1670,  1.3672, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4531, -1.5391,  1.0000,  1.6172, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:37,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 17:59:37,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.53 | bwd_microstep: 1542.07 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1540.81 | step_microstep: 2.10
[2025-11-06 17:59:37,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 255.08 | bwd: 1542.95 | bwd_inner: 1.95 | bwd_allreduce: 1540.85 | step: 2.19
 16%|█▌        | 556/3507 [14:50<1:10:47,  1.44s/it]                                                    {'loss': 0.5827, 'learning_rate': 1.9148432441742985e-05, 'epoch': 0.16}
 16%|█▌        | 556/3507 [14:50<1:10:47,  1.44s/it]tensor([[-4.6875, -3.3906,  0.5312,  0.5391, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.0078, 1.8047, 3.6719, 4.2812, 1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:37,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.67 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.9375, -2.5625,  1.3750,  0.3438, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8281, -0.5820,  2.2031,  0.7930, -1.3672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -2.7500,  1.3359,  0.3516, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -4.2500, -2.0781,  0.3750, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2500, -1.7578,  2.0312,  0.1855, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -4.1875,  0.0393, -0.0879, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:37,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:59:37,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.65 | bwd_microstep: 79.33 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 78.18 | step_microstep: 1.40
[2025-11-06 17:59:37,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.35 | bwd: 80.31 | bwd_inner: 1.96 | bwd_allreduce: 78.22 | step: 1.50
 16%|█▌        | 557/3507 [14:51<55:41,  1.13s/it]                                                    {'loss': 0.4862, 'learning_rate': 1.914469844839432e-05, 'epoch': 0.16}
 16%|█▌        | 557/3507 [14:51<55:41,  1.13s/it]tensor([[-2.8594, -1.3203,  2.1406, -0.2197, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:37,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.46 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -3.5781, -0.8242,  0.6836, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562, -2.0469,  0.9062,  1.0391, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.4375, -0.8047,  0.9531, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1406, -2.6562, -0.9375,  1.3281, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -2.2344,  1.4297,  0.5352, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -4.8438, -1.9766,  0.2070, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -1.4141,  2.0156,  0.0074, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:38,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 17:59:38,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.56 | bwd_microstep: 667.10 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 665.68 | step_microstep: 1.71
[2025-11-06 17:59:38,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.06 | bwd: 668.07 | bwd_inner: 2.19 | bwd_allreduce: 665.71 | step: 1.79
 16%|█▌        | 558/3507 [14:52<54:16,  1.10s/it]                                                  {'loss': 0.2722, 'learning_rate': 1.9140956652149275e-05, 'epoch': 0.16}
 16%|█▌        | 558/3507 [14:52<54:16,  1.10s/it]tensor([[-3.5938, -2.5625,  0.4414,  0.8516, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:38,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 85.02 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7031, -1.6719,  1.2031,  1.9609, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0625, -2.0938,  0.6641,  1.3750, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6250, -2.1562,  1.4688, -0.3379, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188, -2.1875,  0.6953,  0.9727, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3281, -1.3281,  1.4688,  1.5938, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9844, -1.1172,  1.3984,  2.4375, -1.3359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4688, -1.1250,  2.0156,  0.1768, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:39,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 17:59:39,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.26 | bwd_microstep: 927.14 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 925.93 | step_microstep: 2.31
[2025-11-06 17:59:39,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.30 | bwd: 928.06 | bwd_inner: 1.94 | bwd_allreduce: 925.98 | step: 2.39
 16%|█▌        | 559/3507 [14:53<56:03,  1.14s/it]                                                  {'loss': 0.4496, 'learning_rate': 1.9137207056200612e-05, 'epoch': 0.16}
 16%|█▌        | 559/3507 [14:53<56:03,  1.14s/it]tensor([[-5.5312, -4.5312, -1.2266, -0.1143, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -2.4688,  0.7266,  0.0474, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:39,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.05 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3906, -1.9922,  1.3906, -0.0864, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4062, -1.8672,  2.0156, -0.5938, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0312, -2.5312, -0.7656,  1.7578, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0000, -1.9766,  1.0312,  1.6094, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -1.9531,  1.4141, -0.4297, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.7031, -0.9609,  0.5781, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:41,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:59:41,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.45 | bwd_microstep: 1177.68 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1176.47 | step_microstep: 2.08
[2025-11-06 17:59:41,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.52 | bwd: 1178.67 | bwd_inner: 2.04 | bwd_allreduce: 1176.51 | step: 2.16
 16%|█▌        | 560/3507 [14:55<1:01:29,  1.25s/it]                                                    {'loss': 0.9, 'learning_rate': 1.9133449663747753e-05, 'epoch': 0.16}
 16%|█▌        | 560/3507 [14:55<1:01:29,  1.25s/it]tensor([[-3.3438, -1.7891,  1.9453, -0.2422, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9414,  0.4746,  3.5156,  0.9180, -0.6680]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1875, -2.1250,  0.9180,  1.0938, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:41,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.60 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5469, -2.5156,  0.5742,  1.1875, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094, -1.7734,  0.5625,  1.2109, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -2.0312,  0.3652,  1.5781, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -2.7969, -0.2100,  1.5625, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5312, -1.5156,  1.3438,  1.7578, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:41,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:59:41,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.29 | bwd_microstep: 112.71 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 111.62 | step_microstep: 1.43
[2025-11-06 17:59:41,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.92 | bwd: 113.80 | bwd_inner: 2.01 | bwd_allreduce: 111.66 | step: 1.52
 16%|█▌        | 561/3507 [14:55<50:43,  1.03s/it]                                                    {'loss': 0.4193, 'learning_rate': 1.9129684477996762e-05, 'epoch': 0.16}
 16%|█▌        | 561/3507 [14:55<50:43,  1.03s/it]tensor([[-4.3750, -3.3906, -0.3125,  0.9531, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4531, -2.6875, -0.1748,  1.3828, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:41,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.19 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9844, -1.5312,  2.0312, -0.1475, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6719, -2.9062, -0.4668,  0.5898, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3906, -1.4844,  1.0703,  1.8438, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0000, -0.7031,  2.0625, -0.2559, -1.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -2.1406,  1.2344,  0.5703, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -2.0469,  0.8555,  1.1641, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:59:44,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 17:59:44,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.95 | bwd_microstep: 2304.11 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2303.08 | step_microstep: 2.07
[2025-11-06 17:59:44,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.16 | bwd: 2304.86 | bwd_inner: 1.62 | bwd_allreduce: 2303.12 | step: 2.15
 16%|█▌        | 562/3507 [14:58<1:14:50,  1.52s/it]                                                    {'loss': 0.759, 'learning_rate': 1.9125911502160365e-05, 'epoch': 0.16}
 16%|█▌        | 562/3507 [14:58<1:14:50,  1.52s/it]tensor([[-2.3594, -1.0312,  2.3594,  0.5820, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438, -2.2500, -0.2559,  2.0938, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8945,  0.1885,  2.7656,  2.2969, -0.4785]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406, -2.0625, -0.1089,  2.2656, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:44,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7969, -2.1094,  0.0859,  1.7109, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6641, -0.2324,  2.6719,  0.2539, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1875, -3.0781,  0.3965,  1.2109, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -2.6250,  0.7969,  0.6211, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:59:45,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:59:45,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 138.46 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 136.96 | step_microstep: 2.16
[2025-11-06 17:59:45,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.96 | bwd: 139.26 | bwd_inner: 2.15 | bwd_allreduce: 136.99 | step: 2.23
 16%|█▌        | 563/3507 [14:59<1:05:27,  1.33s/it]                                                    {'loss': 0.3636, 'learning_rate': 1.9122130739457926e-05, 'epoch': 0.16}
 16%|█▌        | 563/3507 [14:59<1:05:27,  1.33s/it]tensor([[-2.7344, -2.1094, -0.2061,  1.6250, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:45,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.21 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4844, -2.8125, -0.5781,  1.5859, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.4570, 1.6953, 4.5938, 3.0312, 0.6758]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -1.8828,  0.8125,  1.0469, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.1719, -1.8672,  1.1562, -0.2598, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([2], device='cuda:1')
tensor([[-3.2812, -1.6875,  2.2031, -0.5938, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -2.5312, -0.0898,  1.8516, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -4.4375, -1.2266, -0.1602, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 17:59:48,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 17:59:48,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.66 | bwd_microstep: 2842.56 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2841.49 | step_microstep: 1.99
[2025-11-06 17:59:48,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.90 | bwd: 2843.48 | bwd_inner: 1.80 | bwd_allreduce: 2841.53 | step: 2.09
 16%|█▌        | 564/3507 [15:02<1:33:15,  1.90s/it]                                                    {'loss': 0.5067, 'learning_rate': 1.9118342193115456e-05, 'epoch': 0.16}
 16%|█▌        | 564/3507 [15:02<1:33:15,  1.90s/it]tensor([[-3.6562, -2.8281, -0.2256,  1.3047, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.2061, 0.5742, 1.4375, 3.1719, 0.6211]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:48,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.90 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.9219, -2.2500, -0.0649,  1.7344, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -2.2969,  0.2139,  1.4609, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375, -2.7188,  0.6367,  0.3457, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -2.4531, -0.0520,  0.6172, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0000, -1.9844,  0.8828,  1.1953, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969, -1.8047,  1.9531, -0.1543, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:49,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.12 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 17:59:49,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.11 | bwd_microstep: 175.01 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 173.98 | step_microstep: 2.92
[2025-11-06 17:59:49,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.03 | bwd: 175.84 | bwd_inner: 1.71 | bwd_allreduce: 174.01 | step: 3.00
 16%|█▌        | 565/3507 [15:02<1:12:46,  1.48s/it]                                                    {'loss': 0.3976, 'learning_rate': 1.9114545866365608e-05, 'epoch': 0.16}
 16%|█▌        | 565/3507 [15:02<1:12:46,  1.48s/it]tensor([[-2.7500, -1.2578,  2.5781,  0.1309, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.4531,  0.5430, -0.0193, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:49,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.89 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0781, -1.8125,  1.1875, -0.3516, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1406, -2.2969,  0.3281,  1.8828, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -3.0000,  1.2188, -0.8164, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -1.8594,  1.7812,  0.4727, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.8945, -0.3984,  0.9062,  2.7344, -0.3398]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -2.8281,  1.1953,  0.0143, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:59:49,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 17:59:49,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.53 | bwd_microstep: 414.58 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 413.42 | step_microstep: 1.95
[2025-11-06 17:59:49,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.44 | bwd: 415.39 | bwd_inner: 1.82 | bwd_allreduce: 413.45 | step: 2.02
 16%|█▌        | 566/3507 [15:03<1:02:38,  1.28s/it]                                                    {'loss': 0.5597, 'learning_rate': 1.9110741762447673e-05, 'epoch': 0.16}
 16%|█▌        | 566/3507 [15:03<1:02:38,  1.28s/it]tensor([[-3.3281, -2.0781,  1.2500,  0.5742, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -1.9062,  0.7188,  1.4688, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:50,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.03 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.17
tensor([[-2.7188, -1.9844,  0.3496,  2.4219, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594, -1.9922,  0.4785,  1.5391, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -1.1875,  1.4922,  0.8516, -1.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0781, -2.3750, -0.0177,  2.2188, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125, -2.0469,  0.2041,  1.3281, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -2.4062, -0.6914,  1.2812, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:59:50,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.24
[2025-11-06 17:59:50,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.95 | bwd_microstep: 617.44 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 616.38 | step_microstep: 2.28
[2025-11-06 17:59:50,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.01 | bwd: 618.34 | bwd_inner: 1.78 | bwd_allreduce: 616.42 | step: 2.45
 16%|█▌        | 567/3507 [15:04<58:24,  1.19s/it]                                                    {'loss': 0.4038, 'learning_rate': 1.910692988460758e-05, 'epoch': 0.16}
 16%|█▌        | 567/3507 [15:04<58:24,  1.19s/it]tensor([[0.7109, 1.3281, 2.6875, 3.5625, 0.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -2.1719,  1.1406,  0.7422, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -3.1094,  0.6094,  0.8047, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:51,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.40 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.8906, -1.6875,  1.6406,  1.1797, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -4.1250, -0.3945,  0.3008, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -2.2188,  0.7344,  0.8984, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6250, -1.3594,  1.6797,  0.0192, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.7656, 1.1953, 2.1719, 4.0000, 1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:59:53,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:59:53,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.01 | bwd_microstep: 1952.30 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1951.12 | step_microstep: 1.79
[2025-11-06 17:59:53,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.42 | bwd: 1953.35 | bwd_inner: 2.07 | bwd_allreduce: 1951.16 | step: 1.87
 16%|█▌        | 568/3507 [15:06<1:14:39,  1.52s/it]                                                    {'loss': 0.5552, 'learning_rate': 1.9103110236097885e-05, 'epoch': 0.16}
 16%|█▌        | 568/3507 [15:06<1:14:39,  1.52s/it]tensor([[-3.9062, -2.8125,  0.4434,  0.6758, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000, -1.5000,  1.3125,  2.0000, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -2.6406,  0.3184,  1.6016, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:53,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.66 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.9531, -3.1875, -0.5820,  1.2500, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5000, -2.0312,  1.6797, -0.3359, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -2.4688,  0.6016,  1.0234, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -4.5625, -0.5547, -0.3809, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2188, -2.1875,  0.7344,  0.7305, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:59:54,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 17:59:54,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.48 | bwd_microstep: 1027.87 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1026.80 | step_microstep: 1.74
[2025-11-06 17:59:54,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.16 | bwd: 1028.87 | bwd_inner: 1.91 | bwd_allreduce: 1026.83 | step: 1.81
 16%|█▌        | 569/3507 [15:08<1:12:39,  1.48s/it]                                                    {'loss': 0.5312, 'learning_rate': 1.909928282017779e-05, 'epoch': 0.16}
 16%|█▌        | 569/3507 [15:08<1:12:39,  1.48s/it]tensor([[-2.8594, -2.0625,  0.4160,  1.8672, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:54,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.91 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5312, -2.9531, -1.1016,  1.0703, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.1562, -2.2812,  0.3398,  1.5781, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.3281, -0.3809,  0.9570, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -3.0312,  0.0234, -1.3438, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562, -1.9609,  2.1562, -0.9570, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.9219, -1.8047,  0.7305, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -1.7891,  1.2422,  1.3828, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 17:59:54,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 17:59:54,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.29 | bwd_microstep: 157.91 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 156.88 | step_microstep: 1.94
[2025-11-06 17:59:54,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 210.23 | bwd: 158.96 | bwd_inner: 1.90 | bwd_allreduce: 156.92 | step: 2.03
 16%|█▋        | 570/3507 [15:08<56:39,  1.16s/it]                                                    {'loss': 0.7185, 'learning_rate': 1.9095447640113104e-05, 'epoch': 0.16}
 16%|█▋        | 570/3507 [15:08<56:39,  1.16s/it]tensor([[-3.9844, -3.2188, -0.5312,  1.5469, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 17:59:55,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 116.03 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-3.0781, -1.6953,  1.8281, -0.0537, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -2.9219, -0.7617,  1.6875, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6094, -1.2188,  1.9766, -0.3418, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.4688, -1.2109, -0.0742, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8047, -1.0625,  0.9922,  2.5312, -1.1172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.4375,  0.5273, -0.0138, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969, -1.0469,  1.8125,  0.3965, -1.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 17:59:57,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 17:59:57,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.46 | bwd_microstep: 2106.80 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 2105.82 | step_microstep: 2.32
[2025-11-06 17:59:57,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.50 | bwd: 2107.74 | bwd_inner: 1.72 | bwd_allreduce: 2105.88 | step: 2.43
 16%|█▋        | 571/3507 [15:11<1:15:28,  1.54s/it]                                                    {'loss': 0.239, 'learning_rate': 1.909160469917627e-05, 'epoch': 0.16}
 16%|█▋        | 571/3507 [15:11<1:15:28,  1.54s/it]tensor([[-3.9062, -2.6250,  1.1641,  0.7422, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -2.6406,  1.3438,  0.7344, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)[2025-11-06 17:59:57,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.76 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
 tensor([2], device='cuda:2')
tensor([[-2.9219, -1.3672,  2.4219,  0.0227, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4062, -2.4062,  0.5898,  1.6797, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -2.0469,  0.4570,  1.6562, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3438, -2.5000,  0.0369,  1.1719, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -2.6562,  1.3906, -0.2295, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -2.2812,  1.5859, -0.7617, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 17:59:57,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 17:59:57,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.42 | bwd_microstep: 82.05 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 80.94 | step_microstep: 2.75
[2025-11-06 17:59:57,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.19 | bwd: 82.97 | bwd_inner: 1.82 | bwd_allreduce: 80.98 | step: 2.83
 16%|█▋        | 572/3507 [15:11<1:00:44,  1.24s/it]                                                    {'loss': 0.2977, 'learning_rate': 1.9087754000646362e-05, 'epoch': 0.16}
 16%|█▋        | 572/3507 [15:11<1:00:44,  1.24s/it]tensor([[-2.6094, -1.2812,  1.9062,  0.7969, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250, -0.9883,  2.9688, -0.1187, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 17:59:58,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.12 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-3.7031, -2.7500,  0.2031,  0.9492, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -3.3594,  0.3164,  0.5039, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1250, -2.2969,  0.2578,  1.1953, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.0781,  0.6328,  0.3574, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7500, -2.2500,  1.8828,  0.1396, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8125, -2.0000,  0.3828,  1.1875, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:00,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.18 | optimizer_step: 0.24
[2025-11-06 18:00:00,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.44 | bwd_microstep: 2267.74 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 2266.54 | step_microstep: 2.93
[2025-11-06 18:00:00,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.60 | bwd: 2268.65 | bwd_inner: 1.94 | bwd_allreduce: 2266.59 | step: 3.03
 16%|█▋        | 573/3507 [15:14<1:22:54,  1.70s/it]                                                    {'loss': 0.5998, 'learning_rate': 1.908389554780906e-05, 'epoch': 0.16}
 16%|█▋        | 573/3507 [15:14<1:22:54,  1.70s/it]tensor([[-3.2969, -2.2031,  0.9766,  1.5312, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:00,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.27 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4844, -0.9844,  2.2500, -0.3887, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -2.6250,  0.0781,  1.7422, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1562, -1.0312,  1.9766,  1.1641, -1.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7109, -0.1680,  3.0625,  0.0425, -1.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.4844, -1.9922, -0.4102,  1.4453, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -4.0625,  0.2100, -0.4512, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -2.7812,  1.2266, -0.2559, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:00:01,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:00:01,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.44 | bwd_microstep: 59.83 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 58.80 | step_microstep: 1.62
[2025-11-06 18:00:01,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.73 | bwd: 60.77 | bwd_inner: 1.80 | bwd_allreduce: 58.84 | step: 1.71
 16%|█▋        | 574/3507 [15:15<1:05:50,  1.35s/it]                                                    {'loss': 0.8334, 'learning_rate': 1.908002934395667e-05, 'epoch': 0.16}
 16%|█▋        | 574/3507 [15:15<1:05:50,  1.35s/it]tensor([[-2.9219, -2.3281, -0.4551,  1.4297, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.0000, -4.1250, -1.1250,  0.8594, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -2.4688,  0.5859,  1.7891, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719, -2.5938,  0.5898,  1.0781, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -2.9219,  0.0815,  1.3984, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:01,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.36 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3438, -2.3438,  0.5703,  1.0859, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094, -1.8281,  0.3828,  1.2188, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7188, -1.4375,  1.8438,  0.6055, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:02,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.18 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:00:02,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.57 | bwd_microstep: 80.08 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 78.95 | step_microstep: 2.76
[2025-11-06 18:00:02,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 468.91 | bwd: 81.06 | bwd_inner: 1.90 | bwd_allreduce: 78.99 | step: 2.85
 16%|█▋        | 575/3507 [15:16<1:01:44,  1.26s/it]                                                    {'loss': 0.772, 'learning_rate': 1.90761553923881e-05, 'epoch': 0.16}
 16%|█▋        | 575/3507 [15:16<1:01:44,  1.26s/it]tensor([[-3.1875, -2.6875, -0.8750,  1.7422, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.7422, 1.6016, 3.5312, 4.1250, 1.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.2812, 2.2344, 4.2812, 4.2812, 1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:02,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.03 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -3.3906,  0.1416, -0.6641, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -1.9141,  1.7031,  1.1641, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7344, -1.6094,  1.1250,  0.1963, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4219, -2.4219,  0.4570,  1.1250, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.4219, -0.9375,  1.1250, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:02,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:00:02,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.16 | bwd_microstep: 209.63 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 208.44 | step_microstep: 1.67
[2025-11-06 18:00:02,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.22 | bwd: 210.53 | bwd_inner: 1.93 | bwd_allreduce: 208.48 | step: 1.75
 16%|█▋        | 576/3507 [15:16<51:35,  1.06s/it]                                                    {'loss': 0.421, 'learning_rate': 1.9072273696408886e-05, 'epoch': 0.16}
 16%|█▋        | 576/3507 [15:16<51:35,  1.06s/it]tensor([[-3.4062, -2.8281, -0.7383,  1.8828, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.9531,  0.3027,  0.6523, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:03,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.98 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3438e+00, -3.3125e+00,  3.5858e-04,  1.1406e+00, -3.3594e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.3906,  0.0923,  0.9219, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.6680,  0.2773,  2.7188,  2.7500, -0.2422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -2.1406,  1.6406, -0.4336, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8906,  0.0251,  2.3750,  3.0000, -0.3770]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625, -2.1250,  0.8086,  1.9922, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:05,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.39 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:00:05,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.59 | bwd_microstep: 2390.58 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 2389.31 | step_microstep: 4.01
[2025-11-06 18:00:05,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.59 | bwd: 2391.55 | bwd_inner: 2.05 | bwd_allreduce: 2389.36 | step: 4.10
 16%|█▋        | 577/3507 [15:19<1:16:36,  1.57s/it]                                                    {'loss': 0.5226, 'learning_rate': 1.9068384259331156e-05, 'epoch': 0.16}
 16%|█▋        | 577/3507 [15:19<1:16:36,  1.57s/it]tensor([[-2.7812, -1.7734,  0.9492,  1.4375, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -1.9141,  0.6250,  2.1719, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:05,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.09 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.6719,  0.5156,  3.4375,  2.5000, -0.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -1.1875,  2.7656,  0.0693, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.3125, -0.6484, -0.6875, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7891, -0.2471,  2.9688, -0.2207, -1.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8594, -3.0469, -0.4023,  1.6719, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -3.2344,  0.4004,  0.4570, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:00:06,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:00:06,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.18 | bwd_microstep: 77.66 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 76.67 | step_microstep: 2.15
[2025-11-06 18:00:06,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.29 | bwd: 78.51 | bwd_inner: 1.68 | bwd_allreduce: 76.71 | step: 2.23
 16%|█▋        | 578/3507 [15:19<1:00:17,  1.24s/it]                                                    {'loss': 0.8788, 'learning_rate': 1.9064487084473652e-05, 'epoch': 0.16}
 16%|█▋        | 578/3507 [15:19<1:00:17,  1.24s/it]tensor([[-3.9688, -3.5000, -1.6875,  0.8867, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -3.1562, -1.2969,  1.6172, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.0781,  0.2617,  0.8945, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:06,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.11 | bwd_microstep: 1.36 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.8438, -1.5703,  1.7344,  1.2891, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -1.6484,  1.8203,  0.6484, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5156, -1.9766, -0.1758,  2.2969, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -3.2812, -0.7734,  1.3359, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594, -1.3203,  2.0312, -0.7422, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:08,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.27 | optimizer_step: 0.33
[2025-11-06 18:00:08,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.57 | bwd_microstep: 1495.56 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1494.25 | step_microstep: 2.86
[2025-11-06 18:00:08,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 450.71 | bwd: 1496.93 | bwd_inner: 2.44 | bwd_allreduce: 1494.32 | step: 2.95
 17%|█▋        | 579/3507 [15:21<1:11:22,  1.46s/it]                                                    {'loss': 0.3787, 'learning_rate': 1.9060582175161713e-05, 'epoch': 0.17}
 17%|█▋        | 579/3507 [15:21<1:11:22,  1.46s/it]tensor([[-6.2188, -5.0000, -1.2422, -0.5547, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:08,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.87 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-3.2656, -2.7969, -1.0156,  1.5156, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.6875, -1.3359,  1.4062, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -2.1406,  1.7969, -0.0854, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -3.5781, -1.1172,  0.9922, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -2.3438,  1.5703,  0.3301, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -3.9688, -0.0786, -0.5156, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9844, -0.5586,  3.0781,  1.3203, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:08,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.40 | optimizer_step: 0.60
[2025-11-06 18:00:08,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.63 | bwd_microstep: 233.53 | bwd_inner_microstep: 4.40 | bwd_allreduce_microstep: 228.91 | step_microstep: 4.24
[2025-11-06 18:00:08,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.55 | bwd: 235.53 | bwd_inner: 6.26 | bwd_allreduce: 228.99 | step: 4.36
 17%|█▋        | 580/3507 [15:22<59:59,  1.23s/it]                                                    {'loss': 0.3815, 'learning_rate': 1.9056669534727287e-05, 'epoch': 0.17}
 17%|█▋        | 580/3507 [15:22<59:59,  1.23s/it]tensor([[-4.0938, -3.2188, -0.4258,  1.0078, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.1406, -0.2246,  0.7930, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:08,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.59 | bwd_microstep: 2.14 | bwd_inner_microstep: 1.87 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.20
tensor([[-2.7031, -1.2656,  2.3750,  0.8086, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.1250,  0.6562,  0.5273, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6875, -2.6250,  0.5078,  1.1953, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.3750, -0.8789,  1.6953, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -3.9219,  0.3320, -0.4629, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9648,  0.5312,  3.6094,  0.4336, -0.7383]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:10,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.29 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:00:10,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.62 | bwd_microstep: 1067.28 | bwd_inner_microstep: 2.60 | bwd_allreduce_microstep: 1064.51 | step_microstep: 3.68
[2025-11-06 18:00:10,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.22 | bwd: 1069.40 | bwd_inner: 4.52 | bwd_allreduce: 1064.60 | step: 3.88
 17%|█▋        | 581/3507 [15:24<1:04:18,  1.32s/it]                                                    {'loss': 0.4909, 'learning_rate': 1.905274916650891e-05, 'epoch': 0.17}
 17%|█▋        | 581/3507 [15:24<1:04:18,  1.32s/it]tensor([[-4.9688, -3.8906, -0.4844,  0.6523, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8594, -1.9297,  0.7188,  1.7734, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.1406,  0.5352,  0.8516, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-1.5938, -0.1934,  3.2656,  1.7266, -1.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([3], device='cuda:0')
[2025-11-06 18:00:10,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.35 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.2812, -1.7266,  2.2188, -0.2256, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -1.3984,  2.6719, -0.8008, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6719, -2.6875,  0.2617,  1.1016, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -1.3438,  2.3281,  0.3066, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:00:10,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:00:10,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.05 | bwd_microstep: 102.17 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 100.62 | step_microstep: 2.52
[2025-11-06 18:00:10,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.42 | bwd: 103.13 | bwd_inner: 2.30 | bwd_allreduce: 100.67 | step: 2.63
 17%|█▋        | 582/3507 [15:24<53:14,  1.09s/it]                                                    {'loss': 0.5677, 'learning_rate': 1.904882107385171e-05, 'epoch': 0.17}
 17%|█▋        | 582/3507 [15:24<53:14,  1.09s/it]tensor([[-2.8906, -2.0781,  0.2363,  1.8203, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0938, -1.5078,  2.3281, -0.1001, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -2.1562,  0.2188,  2.0312, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:11,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.12 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.09
tensor([[-5.4062, -4.4375, -1.1094,  1.0234, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -2.8125, -0.2480,  1.1328, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3125, -2.9688,  1.0078,  0.7773, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -1.4688,  2.2500,  0.2500, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2812, -2.1406,  0.9648,  0.9688, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:13,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:00:13,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.40 | bwd_microstep: 1988.35 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1987.28 | step_microstep: 1.82
[2025-11-06 18:00:13,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.58 | bwd: 1990.19 | bwd_inner: 2.62 | bwd_allreduce: 1987.37 | step: 1.91
 17%|█▋        | 583/3507 [15:27<1:12:28,  1.49s/it]                                                    {'loss': 0.3018, 'learning_rate': 1.9044885260107416e-05, 'epoch': 0.17}
 17%|█▋        | 583/3507 [15:27<1:12:28,  1.49s/it]tensor([[-5.5312, -4.0625, -0.1875, -1.3281, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:13,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.10 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.6719, -1.9844,  2.1562, -0.8945, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -1.8906,  1.1016,  0.7930, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.1875,  0.6289,  0.5000, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.1719, -0.0835,  0.9258, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -2.7031, -0.3730,  1.7734, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9844, -1.7266,  1.4688, -0.0337, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -4.1875,  0.3027, -0.0986, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:00:13,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:00:13,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.93 | bwd_microstep: 167.93 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 166.76 | step_microstep: 2.69
[2025-11-06 18:00:13,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.05 | bwd: 168.93 | bwd_inner: 1.99 | bwd_allreduce: 166.79 | step: 2.76
 17%|█▋        | 584/3507 [15:27<58:35,  1.20s/it]                                                    {'loss': 0.4001, 'learning_rate': 1.9040941728634338e-05, 'epoch': 0.17}
 17%|█▋        | 584/3507 [15:27<58:35,  1.20s/it]tensor([[-2.5156, -1.1328,  2.0781,  0.1045, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -2.6094,  0.1895,  1.4922, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3125, -2.7031, -0.5742,  1.9219, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:13,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.38 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0625, -2.6562,  1.3203,  0.3008, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -2.9219, -0.3086,  1.3750, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -3.2656, -0.1768,  0.4766, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -4.0000, -1.3438,  0.5703, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -4.1250, -0.8906,  0.2383, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:15,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:00:15,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.85 | bwd_microstep: 1773.55 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1772.43 | step_microstep: 1.85
[2025-11-06 18:00:15,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.26 | bwd: 1774.37 | bwd_inner: 1.79 | bwd_allreduce: 1772.47 | step: 1.92
 17%|█▋        | 585/3507 [15:29<1:12:17,  1.48s/it]                                                    {'loss': 0.3793, 'learning_rate': 1.903699048279737e-05, 'epoch': 0.17}
 17%|█▋        | 585/3507 [15:29<1:12:17,  1.48s/it]tensor([[-5.1250, -4.0625, -0.5156,  1.1406, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6562, -0.9062,  1.2578,  2.8438, -0.9883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:16,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.66 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.8125, -2.4219,  1.4688,  0.8789, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3594, -0.7656,  3.0156,  0.6953, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3594, -1.8516,  1.9844,  0.2109, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.9219, -1.3359,  1.2969, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2500, -1.6406,  2.1562, -0.3223, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -3.0000, -0.1729,  1.1953, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:16,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.29 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:00:16,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.43 | bwd_microstep: 422.85 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 421.32 | step_microstep: 3.23
[2025-11-06 18:00:16,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.11 | bwd: 423.81 | bwd_inner: 2.32 | bwd_allreduce: 421.36 | step: 3.32
 17%|█▋        | 586/3507 [15:30<1:02:36,  1.29s/it]                                                    {'loss': 0.2059, 'learning_rate': 1.9033031525967992e-05, 'epoch': 0.17}
 17%|█▋        | 586/3507 [15:30<1:02:36,  1.29s/it]tensor([[-4.0625, -3.2031, -0.4355,  1.5859, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2188, -0.8242,  2.2812,  0.6680, -1.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:00:16,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5156, -1.2344,  2.2188,  1.4297, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9531, -1.6484,  1.7266,  1.3281, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -2.1875,  1.4453,  0.9531, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625, -2.2031,  0.4805,  2.3125, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4688, -2.0312,  1.7188, -0.2256, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0312, -1.8438,  1.0703,  0.3848, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:18,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.31 | optimizer_step: 0.43
[2025-11-06 18:00:18,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.60 | bwd_microstep: 1210.65 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1209.59 | step_microstep: 3.65
[2025-11-06 18:00:18,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.16 | bwd: 1211.53 | bwd_inner: 1.68 | bwd_allreduce: 1209.67 | step: 3.72
 17%|█▋        | 587/3507 [15:32<1:07:57,  1.40s/it]                                                    {'loss': 0.3337, 'learning_rate': 1.9029064861524267e-05, 'epoch': 0.17}
 17%|█▋        | 587/3507 [15:32<1:07:57,  1.40s/it]tensor([[-2.9062, -1.5000,  2.3281,  1.3125, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3906, -1.2031,  1.7656,  1.2500, -1.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688, -1.0312,  2.2812,  0.2578, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -3.6562, -0.2754,  0.1406, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:18,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.63 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.18
tensor([[-4.1875, -3.0938,  0.3711,  1.1328, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6094, -1.9766,  0.1006,  2.4688, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250, -2.6094, -0.9219,  1.2656, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-2.7031, -2.0781, -0.0684,  2.1094, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:19,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.33 | optimizer_step: 0.35
[2025-11-06 18:00:19,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.81 | bwd_microstep: 422.78 | bwd_inner_microstep: 2.47 | bwd_allreduce_microstep: 420.11 | step_microstep: 3.24
[2025-11-06 18:00:19,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.47 | bwd: 424.59 | bwd_inner: 4.11 | bwd_allreduce: 420.20 | step: 3.42
 17%|█▋        | 588/3507 [15:33<1:02:00,  1.27s/it]                                                    {'loss': 0.7891, 'learning_rate': 1.902509049285083e-05, 'epoch': 0.17}
 17%|█▋        | 588/3507 [15:33<1:02:00,  1.27s/it]tensor([[-3.1250, -1.6875,  1.7891, -0.0747, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625, -2.5938,  0.3359,  1.6562, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -2.3438,  1.0156,  0.0245, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -2.0938,  0.7148,  1.5156, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -2.7344,  0.0388,  1.5156, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:20,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.49 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-3.9844, -3.0938, -0.2451,  0.9922, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7031, -2.1719, -0.3516,  2.2656, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406, -1.8438,  0.5352,  2.0469, -1.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:21,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:00:21,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.71 | bwd_microstep: 480.74 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 479.53 | step_microstep: 1.69
[2025-11-06 18:00:21,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.20 | bwd: 482.00 | bwd_inner: 2.18 | bwd_allreduce: 479.60 | step: 1.86
 17%|█▋        | 589/3507 [15:35<1:09:44,  1.43s/it]                                                    {'loss': 0.4841, 'learning_rate': 1.9021108423338886e-05, 'epoch': 0.17}
 17%|█▋        | 589/3507 [15:35<1:09:44,  1.43s/it]tensor([[-3.3750, -2.6719, -0.3047,  1.8359, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -3.4688, -1.2969,  0.8242, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -1.2891,  2.5625, -0.7266, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -2.4062,  1.0859,  0.9141, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:21,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.5938, -3.5000, -0.3223, -0.1348, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1016,  0.0869,  2.7500,  1.9297, -0.6680]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5781, -0.9688,  2.3750, -0.3359, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.4688, -1.7188,  0.5078,  2.3594, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:21,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:00:21,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.78 | bwd_microstep: 2.49 | bwd_inner_microstep: 1.56 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.16
[2025-11-06 18:00:21,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.07 | bwd: 3.35 | bwd_inner: 2.33 | bwd_allreduce: 0.89 | step: 2.25
 17%|█▋        | 590/3507 [15:35<58:32,  1.20s/it]                                                    {'loss': 0.7152, 'learning_rate': 1.901711865638622e-05, 'epoch': 0.17}
 17%|█▋        | 590/3507 [15:35<58:32,  1.20s/it]tensor([[-3.0000, -1.9922,  0.8125,  1.4453, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -2.3438,  2.0156, -0.5352, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062, -1.7812,  2.3281, -0.1030, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -2.8281,  0.0908,  0.7930, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -2.2344,  1.0156,  0.4434, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:22,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.14 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-3.3594, -2.9219, -1.2500,  1.3359, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8906, -3.2031, -0.8359,  1.2891, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -3.1562, -0.3574,  1.1172, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:24,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:00:24,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.43 | bwd_microstep: 926.63 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 925.56 | step_microstep: 2.05
[2025-11-06 18:00:24,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.55 | bwd: 927.56 | bwd_inner: 1.86 | bwd_allreduce: 925.59 | step: 2.13
 17%|█▋        | 591/3507 [15:38<1:18:35,  1.62s/it]                                                    {'loss': 0.4467, 'learning_rate': 1.9013121195397175e-05, 'epoch': 0.17}
 17%|█▋        | 591/3507 [15:38<1:18:35,  1.62s/it]tensor([[-3.1875, -1.5391,  2.3281, -0.4883, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -2.6875,  0.9453,  1.7656, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -3.0938, -0.4746,  1.4844, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -3.0625, -0.6367,  1.9062, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:24,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.42 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6719, -2.8594, -0.1816,  1.1953, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031, -1.1953,  2.4688,  0.6562, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.5156, -1.2578,  1.6406, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -3.5312, -0.4375,  0.6641, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:24,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:00:24,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.70 | bwd_microstep: 1.88 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.72
[2025-11-06 18:00:24,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.15 | bwd: 2.76 | bwd_inner: 1.89 | bwd_allreduce: 0.74 | step: 1.81
 17%|█▋        | 592/3507 [15:38<1:01:22,  1.26s/it]                                                    {'loss': 0.8743, 'learning_rate': 1.900911604378267e-05, 'epoch': 0.17}
 17%|█▋        | 592/3507 [15:38<1:01:22,  1.26s/it]tensor([[-1.0703,  0.5547,  3.7812,  0.5117, -0.8242]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -1.9219,  1.6406,  0.8398, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2031, -0.9648,  1.9375,  0.2930, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9062, -2.1094,  0.4082,  2.1406, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:25,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.62 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1875, -3.6250, -1.4922,  0.9688, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -2.5000,  1.2656,  0.3789, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -4.0000,  0.6484, -0.6953, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5156, -2.7969, -0.4043,  1.5078, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:00:25,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:00:25,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.70 | bwd_microstep: 195.01 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 193.90 | step_microstep: 1.55
[2025-11-06 18:00:25,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.35 | bwd: 195.86 | bwd_inner: 1.80 | bwd_allreduce: 193.94 | step: 1.62
 17%|█▋        | 593/3507 [15:39<50:54,  1.05s/it]                                                    {'loss': 0.2305, 'learning_rate': 1.9005103204960174e-05, 'epoch': 0.17}
 17%|█▋        | 593/3507 [15:39<50:54,  1.05s/it]tensor([[-3.5625, -2.3594,  0.5117, -0.4961, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.4688, -1.0703,  2.0938, -0.1011, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -4.7812,  0.1089, -0.8789, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -3.1875,  0.2256,  0.1611, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7812, -1.4453,  1.9062,  1.1172, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6875, -1.5859,  1.2188,  0.3105, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:25,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.70 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0000, -1.4609,  2.4375,  0.1738, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -2.4688, -0.1426,  1.2188, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:27,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:00:27,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.08
[2025-11-06 18:00:27,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.03 | bwd: 2.91 | bwd_inner: 1.92 | bwd_allreduce: 0.87 | step: 2.16
 17%|█▋        | 594/3507 [15:41<1:02:03,  1.28s/it]                                                    {'loss': 0.7198, 'learning_rate': 1.900108268235373e-05, 'epoch': 0.17}
 17%|█▋        | 594/3507 [15:41<1:02:03,  1.28s/it]tensor([[-2.4688, -1.1562,  2.0000,  0.4062, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7812, -1.9219,  0.6484,  2.1250, -1.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:27,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.53 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6875, -2.3906,  1.3516,  0.8750, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -3.0156, -0.7383,  1.7344, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5781, -1.0000,  2.5625,  0.1562, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -4.0938, -1.7422,  1.3828, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.2188,  1.4922,  0.2314, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -2.7031,  1.3750,  0.1904, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:27,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:00:27,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.43 | bwd_microstep: 1.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 1.01 | step_microstep: 2.05
[2025-11-06 18:00:27,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.99 | bwd: 2.93 | bwd_inner: 1.74 | bwd_allreduce: 1.04 | step: 2.13
 17%|█▋        | 595/3507 [15:41<53:50,  1.11s/it]                                                    {'loss': 0.2317, 'learning_rate': 1.8997054479393925e-05, 'epoch': 0.17}
 17%|█▋        | 595/3507 [15:41<53:50,  1.11s/it]tensor([[-2.8281, -2.1250,  0.1309,  2.1562, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094, -1.3984,  2.6875, -0.6445, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.4609, 2.6719, 4.7188, 2.5625, 1.4297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -2.5625,  0.3164,  1.2188, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:28,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.58 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4375, -2.7344, -0.4375,  1.2812, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9062, -4.1562, -1.3672,  0.7852, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -2.2031,  0.0330,  1.9375, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -2.1406,  1.1250,  0.7148, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:00:28,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:00:28,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.72 | bwd_microstep: 340.16 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 339.04 | step_microstep: 1.88
[2025-11-06 18:00:28,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.30 | bwd: 341.10 | bwd_inner: 1.86 | bwd_allreduce: 339.09 | step: 1.98
 17%|█▋        | 596/3507 [15:42<50:42,  1.05s/it]                                                  {'loss': 0.241, 'learning_rate': 1.8993018599517897e-05, 'epoch': 0.17}
 17%|█▋        | 596/3507 [15:42<50:42,  1.05s/it]tensor([[-4.5625, -2.7969,  1.8047, -1.1562, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:29,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.23 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.2812, -1.0859,  2.0469,  2.1406, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719, -2.2031,  0.7188,  1.9766, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -2.6562, -0.1143,  1.5000, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7188, -1.8750,  0.6484,  2.0938, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688, -2.4219,  0.6836,  1.4375, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.0000,  1.3672, -0.9609, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8438, -1.4219,  1.9453,  0.2295, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:29,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:00:29,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.07 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.89 | step_microstep: 1.86
[2025-11-06 18:00:29,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.31 | bwd: 2.96 | bwd_inner: 1.93 | bwd_allreduce: 0.91 | step: 1.93
 17%|█▋        | 597/3507 [15:43<49:00,  1.01s/it]                                                  {'loss': 0.4935, 'learning_rate': 1.8988975046169352e-05, 'epoch': 0.17}
 17%|█▋        | 597/3507 [15:43<49:00,  1.01s/it]tensor([[-3.6875, -2.5469,  0.7422,  0.7812, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:30,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.21 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.0312, -2.9375,  0.2256,  0.9414, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -2.8906,  0.2637,  0.4727, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -1.8281,  1.6172,  0.2637, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2812, -2.3750,  0.3848,  1.5000, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -4.6875, -1.3750,  0.3789, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -1.9844,  1.5000,  0.0505, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8906, -2.9688, -0.0889,  1.4219, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:00:32,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:00:32,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.28 | bwd_microstep: 299.86 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 298.69 | step_microstep: 1.86
[2025-11-06 18:00:32,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.51 | bwd: 300.86 | bwd_inner: 1.99 | bwd_allreduce: 298.74 | step: 1.96
 17%|█▋        | 598/3507 [15:45<1:07:47,  1.40s/it]                                                    {'loss': 0.403, 'learning_rate': 1.898492382279853e-05, 'epoch': 0.17}
 17%|█▋        | 598/3507 [15:45<1:07:47,  1.40s/it]tensor([[-3.8438, -2.8125,  0.3613,  1.5547, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1406, -2.4219, -0.0625,  2.2188, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:32,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.17 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4375, -2.2344,  1.0234,  0.7852, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -1.6797,  2.1562, -0.3906, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -3.9531, -1.8203,  1.1797, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625, -0.5625,  2.5312,  0.1367, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.0469, -0.0508,  2.5312,  2.5938, -0.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -2.5625,  0.5938,  1.5234, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:32,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:00:32,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.07 | bwd_microstep: 83.90 | bwd_inner_microstep: 1.76 | bwd_allreduce_microstep: 82.07 | step_microstep: 1.52
[2025-11-06 18:00:32,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.26 | bwd: 84.95 | bwd_inner: 2.72 | bwd_allreduce: 82.10 | step: 1.59
 17%|█▋        | 599/3507 [15:46<57:23,  1.18s/it]                                                    {'loss': 0.4838, 'learning_rate': 1.8980864932862214e-05, 'epoch': 0.17}
 17%|█▋        | 599/3507 [15:46<57:23,  1.18s/it]tensor([[-3.4844, -2.5156,  0.3574,  1.8281, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -2.4062,  1.8281, -1.0781, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:33,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.47 | bwd_microstep: 1.18 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12
tensor([[-2.5000, -1.9062,  0.0488,  2.4062, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -3.8906, -0.7578,  1.2109, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7812, -2.5938,  0.7422,  0.4785, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2344, -1.6562,  2.1562, -0.1992, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -2.9531, -0.8086,  1.4688, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -2.2031,  1.0859,  1.0156, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:00:34,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:00:34,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.98 | bwd_microstep: 384.95 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 383.89 | step_microstep: 1.97
[2025-11-06 18:00:34,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 492.48 | bwd: 386.13 | bwd_inner: 1.95 | bwd_allreduce: 383.98 | step: 2.10
 17%|█▋        | 600/3507 [15:47<59:43,  1.23s/it]                                                  {'loss': 0.275, 'learning_rate': 1.897679837982373e-05, 'epoch': 0.17}
 17%|█▋        | 600/3507 [15:47<59:43,  1.23s/it]tensor([[-5.4688, -3.8906,  0.6523, -0.8516, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:34,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.71 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-4.6875, -3.3125,  0.5078, -0.1680, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -2.9844,  1.2188, -0.0913, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9219, -2.1875,  0.1660,  2.5156, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625, -2.8750, -0.4062,  2.1406, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7812, -2.6562,  0.5508,  1.1719, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6562, -1.6953,  1.0547,  2.1875, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -4.1562, -1.1328,  0.5391, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:35,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 18:00:35,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.88 | bwd_microstep: 978.17 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 976.95 | step_microstep: 2.00
[2025-11-06 18:00:35,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.60 | bwd: 979.23 | bwd_inner: 2.09 | bwd_allreduce: 977.00 | step: 2.10
 17%|█▋        | 601/3507 [15:49<1:08:34,  1.42s/it]                                                    {'loss': 0.3377, 'learning_rate': 1.8972724167152958e-05, 'epoch': 0.17}
 17%|█▋        | 601/3507 [15:49<1:08:34,  1.42s/it]tensor([[-4.7500, -3.3594,  0.6758,  0.2012, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -2.2812,  0.4961,  1.8594, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281, -2.5156,  0.0679,  1.8047, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.6250,  0.6484,  0.2969, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:36,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.52 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.1875, -2.4688, -0.0605,  2.2812, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -3.0625, -1.0156,  1.4688, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5781, -1.3359,  1.6016,  0.7148, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -2.8125,  1.7266, -0.3672, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:36,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:00:36,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.86 | bwd_microstep: 17.11 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 15.88 | step_microstep: 1.74
[2025-11-06 18:00:36,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 411.41 | bwd: 17.93 | bwd_inner: 1.89 | bwd_allreduce: 15.91 | step: 1.81
 17%|█▋        | 602/3507 [15:50<54:46,  1.13s/it]                                                    {'loss': 0.2826, 'learning_rate': 1.896864229832629e-05, 'epoch': 0.17}
 17%|█▋        | 602/3507 [15:50<54:46,  1.13s/it]tensor([[-3.5781, -3.0469, -1.0781,  1.5000, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -4.6250, -2.4531,  0.5156, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.7812, -1.9688,  0.4453,  1.7656, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:36,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.63 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1250, -2.6719,  1.3516,  0.5039, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5938, -1.3438,  1.7422,  0.3926, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.4844, -0.2969,  2.1875, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -2.5000, -0.3555,  1.9609, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -1.9375,  0.8398,  1.7344, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:38,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:00:38,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.80 | bwd_microstep: 939.21 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 938.16 | step_microstep: 1.88
[2025-11-06 18:00:38,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.45 | bwd: 939.98 | bwd_inner: 1.63 | bwd_allreduce: 938.21 | step: 1.96
 17%|█▋        | 603/3507 [15:52<1:07:04,  1.39s/it]                                                    {'loss': 0.9433, 'learning_rate': 1.8964552776826662e-05, 'epoch': 0.17}
 17%|█▋        | 603/3507 [15:52<1:07:04,  1.39s/it]tensor([[-4.9688, -3.6719,  0.2275,  0.5820, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3750, -1.6797,  0.5078,  2.7188, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.4219, -0.1211,  0.5078, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:38,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.42 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2656, -1.4062,  0.9336,  1.8359, -1.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7031, -2.5781,  0.7188,  0.8984, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6875, -2.3125,  1.2812,  0.2539, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2500, -0.7305,  2.4688, -0.5234, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -5.0625, -1.1953, -1.5234, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:00:39,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:00:39,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.00 | bwd_microstep: 771.21 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 770.12 | step_microstep: 1.65
[2025-11-06 18:00:39,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.45 | bwd: 772.07 | bwd_inner: 1.80 | bwd_allreduce: 770.16 | step: 1.73
 17%|█▋        | 604/3507 [15:53<1:03:50,  1.32s/it]                                                    {'loss': 0.3991, 'learning_rate': 1.896045560614355e-05, 'epoch': 0.17}
 17%|█▋        | 604/3507 [15:53<1:03:50,  1.32s/it]tensor([[-4.7188, -3.5156,  0.2812,  1.2031, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0156, -0.7344,  2.1250,  0.1621, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:39,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.93 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.6094, -1.9844,  2.3281,  0.1777, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4219, -0.9883,  2.5938,  0.9961, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1250, -1.8438,  1.4219,  0.7148, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -4.0000, -1.8906,  1.0547, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.4062, -0.5195,  0.7422, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -2.1406,  0.2617,  2.2969, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:00:41,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:00:41,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.22 | bwd_microstep: 1993.18 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 1991.83 | step_microstep: 1.54
[2025-11-06 18:00:41,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.17 | bwd: 1994.05 | bwd_inner: 2.03 | bwd_allreduce: 1991.88 | step: 1.63
 17%|█▋        | 605/3507 [15:55<1:18:37,  1.63s/it]                                                    {'loss': 0.2302, 'learning_rate': 1.8956350789772937e-05, 'epoch': 0.17}
 17%|█▋        | 605/3507 [15:55<1:18:37,  1.63s/it]tensor([[-2.6875, -1.2344,  2.4531,  1.1406, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:42,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.86 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2188, -2.4844, -0.0079,  2.2656, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4688, -2.4688,  0.4219,  1.0938, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -2.9219,  1.5000,  0.2949, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -1.8516,  2.1250, -0.6875, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -2.1406,  1.9922, -0.8867, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -2.3281,  1.9062, -1.3047, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -1.7344,  2.0781, -0.2852, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:42,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:00:42,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.41 | bwd_microstep: 183.72 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 182.48 | step_microstep: 1.61
[2025-11-06 18:00:42,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.29 | bwd: 184.93 | bwd_inner: 2.30 | bwd_allreduce: 182.51 | step: 1.68
 17%|█▋        | 606/3507 [15:56<1:04:24,  1.33s/it]                                                    {'loss': 0.2672, 'learning_rate': 1.895223833121734e-05, 'epoch': 0.17}
 17%|█▋        | 606/3507 [15:56<1:04:24,  1.33s/it]tensor([[-2.6719, -1.5625,  1.1719,  0.2373, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -1.5391,  1.9688, -0.3848, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -2.7188,  1.0859, -0.3320, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -1.8203,  2.0312, -1.1094, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -2.2812,  1.3750,  0.3750, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:42,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.85 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.3594, -1.8125,  2.2031,  0.4746, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906, -1.9531,  1.9141,  1.0469, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9375, -3.1094, -0.4453,  1.3203, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:43,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:00:43,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.13 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.71
[2025-11-06 18:00:43,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 424.04 | bwd: 3.04 | bwd_inner: 2.02 | bwd_allreduce: 0.89 | step: 1.80
 17%|█▋        | 607/3507 [15:56<51:51,  1.07s/it]                                                    {'loss': 0.2455, 'learning_rate': 1.8948118233985803e-05, 'epoch': 0.17}
 17%|█▋        | 607/3507 [15:56<51:51,  1.07s/it]tensor([[-4.0000, -2.5156,  1.5156,  0.5469, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9844, -0.5352,  2.9531,  1.1484, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6875, -1.2266,  1.8750, -0.4395, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.5938, -2.8438, -0.3320,  1.7109, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8281, -1.5469,  1.5547,  0.3887, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:43,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.28 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9375, -3.7188, -0.3984, -0.9453, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5156, -1.1484,  2.0469, -0.2715, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3125, -6.5000, -3.1875,  0.1992, -5.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:00:44,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:00:44,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.94 | bwd_microstep: 786.13 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 785.08 | step_microstep: 1.74
[2025-11-06 18:00:44,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.23 | bwd: 786.99 | bwd_inner: 1.75 | bwd_allreduce: 785.11 | step: 1.81
 17%|█▋        | 608/3507 [15:58<56:04,  1.16s/it]                                                  {'loss': 0.6898, 'learning_rate': 1.8943990501593873e-05, 'epoch': 0.17}
 17%|█▋        | 608/3507 [15:58<56:04,  1.16s/it]tensor([[-2.3906, -1.8516, -0.1650,  1.8125, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -1.4922,  2.0469,  0.8438, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -2.0781,  1.5000,  0.6992, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:44,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.74 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0781, -2.0312,  0.8164,  0.6562, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -3.9844,  0.1289,  0.7344, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.1875,  0.4570,  1.5859, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719, -0.9023,  2.8750, -0.5625, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -3.1562, -1.3516,  1.1797, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:00:45,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:00:45,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.91 | bwd_microstep: 942.77 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 941.60 | step_microstep: 161.67
[2025-11-06 18:00:45,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.67 | bwd: 943.82 | bwd_inner: 2.06 | bwd_allreduce: 941.64 | step: 161.76
 17%|█▋        | 609/3507 [15:59<1:00:10,  1.25s/it]                                                    {'loss': 0.3153, 'learning_rate': 1.8939855137563627e-05, 'epoch': 0.17}
 17%|█▋        | 609/3507 [15:59<1:00:10,  1.25s/it]tensor([[-6.1250, -5.5625, -3.0625,  0.1650, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.5312, -4.5000, -1.2656, -0.2031, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -3.2969,  0.6055, -1.7656, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:00:46,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.92 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1250, -2.6719,  1.3359,  0.3086, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4062, -2.6250,  0.0420,  2.2031, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0000, -2.5469, -0.8594,  1.9141, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.8906, -2.4688, -1.0312,  1.2188, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.0938, -2.5938,  1.5156, -0.2275, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:00:48,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:00:48,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.48 | bwd_microstep: 1861.60 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 1860.12 | step_microstep: 2.11
[2025-11-06 18:00:48,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.43 | bwd: 1862.50 | bwd_inner: 2.22 | bwd_allreduce: 1860.16 | step: 2.19
 17%|█▋        | 610/3507 [16:01<1:14:30,  1.54s/it]                                                    {'loss': 2.3079, 'learning_rate': 1.8935712145423643e-05, 'epoch': 0.17}
 17%|█▋        | 610/3507 [16:01<1:14:30,  1.54s/it]tensor([[-7.0000, -5.7188, -1.6484, -0.9922, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0312, -2.3438, -0.0771,  2.2812, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.0938,  1.2188,  0.3066, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:48,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.03 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7969, -2.1719,  2.1719,  0.0942, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812e+00, -1.1172e+00,  2.7656e+00,  1.9226e-03, -2.2656e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -1.6953,  1.3594,  1.5781, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6094, -1.3594,  1.8203,  1.3359, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -2.7656,  1.4453,  0.3496, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:48,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:00:48,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.14 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.45
[2025-11-06 18:00:48,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 545.20 | bwd: 2.98 | bwd_inner: 2.12 | bwd_allreduce: 0.74 | step: 1.53
 17%|█▋        | 611/3507 [16:02<1:00:42,  1.26s/it]                                                    {'loss': 0.571, 'learning_rate': 1.893156152870901e-05, 'epoch': 0.17}
 17%|█▋        | 611/3507 [16:02<1:00:42,  1.26s/it]tensor([[-3.0938, -1.6406,  2.0938,  0.4004, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:48,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.86 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.1875, -3.5938, -1.2656,  1.5625, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1875, -1.6094,  2.1406, -0.1396, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -2.7188, -0.0410,  1.9688, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6875, -4.7812, -1.5703,  0.3945, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7266,  0.6797,  4.0000,  2.1719, -0.3965]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1875, -0.8281,  2.1719,  0.3301, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -3.7344,  0.0859,  0.8438, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:00:50,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:00:50,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.56 | bwd_microstep: 1306.80 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1305.66 | step_microstep: 3.01
[2025-11-06 18:00:50,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.44 | bwd: 1307.76 | bwd_inner: 1.95 | bwd_allreduce: 1305.69 | step: 3.09
 17%|█▋        | 612/3507 [16:04<1:06:28,  1.38s/it]                                                    {'loss': 0.189, 'learning_rate': 1.892740329096133e-05, 'epoch': 0.17}
 17%|█▋        | 612/3507 [16:04<1:06:28,  1.38s/it]tensor([[-5.5000, -4.0938,  0.1108,  0.1182, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.0781,  0.4805,  0.7188, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -2.9062, -0.0874,  1.4609, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:50,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.40 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.5312, -2.3594,  0.9414,  0.7109, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.8750, -0.6172,  0.7070, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.1406, -0.0688,  0.6445, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -2.5625,  1.4375,  0.5000, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -2.1719,  1.1250,  1.8750, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:00:50,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:00:50,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 174.39 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 173.14 | step_microstep: 1.59
[2025-11-06 18:00:50,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.28 | bwd: 175.36 | bwd_inner: 2.00 | bwd_allreduce: 173.20 | step: 1.69
 17%|█▋        | 613/3507 [16:04<55:00,  1.14s/it]                                                    {'loss': 0.5747, 'learning_rate': 1.89232374357287e-05, 'epoch': 0.17}
 17%|█▋        | 613/3507 [16:04<55:00,  1.14s/it]tensor([[-2.9219, -1.5156,  1.7500, -0.2070, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:51,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 116.65 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.6875, -3.0156, -0.6055,  1.9297, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -1.9844,  2.3438, -0.4707, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7969, -1.6953,  1.4062,  1.9453, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3438, -2.6719, -0.4590,  1.5859, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.6094,  1.7188,  0.0933, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344, -2.5469, -0.2295,  2.3438, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.8125,  0.6953,  0.8672, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:00:53,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:00:53,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.60 | bwd_microstep: 1968.37 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 1967.39 | step_microstep: 2.54
[2025-11-06 18:00:53,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.27 | bwd: 1969.31 | bwd_inner: 1.74 | bwd_allreduce: 1967.44 | step: 2.63
 18%|█▊        | 614/3507 [16:07<1:11:39,  1.49s/it]                                                    {'loss': 0.2634, 'learning_rate': 1.8919063966565717e-05, 'epoch': 0.18}
 18%|█▊        | 614/3507 [16:07<1:11:39,  1.49s/it]tensor([[-3.3281, -1.9062,  1.6875,  0.2852, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -3.7656, -0.2432,  0.6641, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5938, -2.2031,  1.2734, -0.1006, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8750, -2.1250,  0.1973,  1.8750, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:53,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
[h264 @ 0xc58f8c0] SEI type 0 size 64 truncated at 56
[h264 @ 0x9184240] SEI type 0 size 64 truncated at 56
[h264 @ 0x9184240] SEI type 0 size 64 truncated at 56
[h264 @ 0x9184240] SEI type 0 size 64 truncated at 56
tensor([[-4.2188, -3.1094,  0.2773,  0.9570, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5625, -3.1250, -1.5312,  0.9805, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.8750, -3.0938,  1.5703, -1.1328, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9531, -1.8125,  1.2578,  1.3672, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:54,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 18:00:54,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.40 | bwd_microstep: 1078.84 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 1077.49 | step_microstep: 2.16
[2025-11-06 18:00:54,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.96 | bwd: 1079.79 | bwd_inner: 2.11 | bwd_allreduce: 1077.54 | step: 2.25
 18%|█▊        | 615/3507 [16:08<1:11:32,  1.48s/it]                                                    {'loss': 0.9557, 'learning_rate': 1.891488288703348e-05, 'epoch': 0.18}
 18%|█▊        | 615/3507 [16:08<1:11:32,  1.48s/it]tensor([[-3.3750, -2.5938, -0.0762,  1.7812, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1113,  1.0078,  3.3594,  2.0625,  0.0991]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4922, -0.7617,  1.2812,  2.8281, -0.8477]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:54,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.41 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2344, -2.5625, -0.2139,  2.1719, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -3.1094,  0.1934,  1.2812, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.2812, -0.2930,  1.4766, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -2.7812,  1.1719,  0.2930, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5156, -2.9844, -1.0234,  1.5938, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:00:55,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:00:55,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.20 | bwd_microstep: 94.54 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 93.49 | step_microstep: 1.45
[2025-11-06 18:00:55,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.64 | bwd: 95.44 | bwd_inner: 1.79 | bwd_allreduce: 93.53 | step: 1.52
 18%|█▊        | 616/3507 [16:08<57:13,  1.19s/it]                                                    {'loss': 0.5302, 'learning_rate': 1.891069420069957e-05, 'epoch': 0.18}
 18%|█▊        | 616/3507 [16:09<57:13,  1.19s/it]tensor([[-3.0156, -1.9609,  1.1094,  1.9453, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3438, -1.1641,  1.6875,  0.7422, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7344, -1.1484,  2.3438,  0.4570, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656, -2.7969, -1.0078,  1.9453, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:00:55,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.72 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2188, -2.9531,  0.5234, -0.3633, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9062, -2.2812, -0.0830,  2.4844, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6016, -0.2695,  3.0938,  1.6562, -1.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6250, -1.3906,  1.7266,  0.6133, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:00:57,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:00:57,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.28 | bwd_microstep: 1708.70 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1707.46 | step_microstep: 2.06
[2025-11-06 18:00:57,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 512.03 | bwd: 1709.49 | bwd_inner: 1.86 | bwd_allreduce: 1707.51 | step: 2.14
 18%|█▊        | 617/3507 [16:11<1:12:45,  1.51s/it]                                                    {'loss': 0.4465, 'learning_rate': 1.8906497911138082e-05, 'epoch': 0.18}
 18%|█▊        | 617/3507 [16:11<1:12:45,  1.51s/it]tensor([[-5.3125, -3.9062,  0.4492,  0.3730, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.5156,  1.0312, -0.4668, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:57,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.73 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.8438, -3.8750, -0.9023, -0.1826, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5391, -0.7305,  1.5234,  2.7031, -0.9180]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2344, -1.0703,  1.6016,  0.8320, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9375, -1.6250,  1.6562,  0.8359, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -2.2656,  1.1562,  0.8203, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -3.0156, -0.0596,  1.1328, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:00:57,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:00:57,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.96 | bwd_microstep: 129.50 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 128.36 | step_microstep: 1.68
[2025-11-06 18:00:57,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 283.71 | bwd: 130.39 | bwd_inner: 1.86 | bwd_allreduce: 128.40 | step: 1.77
 18%|█▊        | 618/3507 [16:11<57:22,  1.19s/it]                                                    {'loss': 0.4233, 'learning_rate': 1.8902294021929572e-05, 'epoch': 0.18}
 18%|█▊        | 618/3507 [16:11<57:22,  1.19s/it]tensor([[-2.8125, -1.8203,  1.0391,  1.4375, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:00:58,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.79 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -3.5625, -1.1797,  1.9453, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6719, -1.2891,  1.8359, -0.1797, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.1250,  0.2217,  1.5703, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.1562, -0.1611,  0.2305, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -2.3906,  1.5625, -0.0203, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -1.6797,  2.5156, -0.4668, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7188, -2.5781,  0.6016,  0.4199, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:00,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:01:00,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.97 | bwd_microstep: 2289.94 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2288.87 | step_microstep: 1.70
[2025-11-06 18:01:00,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.79 | bwd: 2290.84 | bwd_inner: 1.81 | bwd_allreduce: 2288.91 | step: 1.79
 18%|█▊        | 619/3507 [16:14<1:19:27,  1.65s/it]                                                    {'loss': 0.4144, 'learning_rate': 1.8898082536661097e-05, 'epoch': 0.18}
 18%|█▊        | 619/3507 [16:14<1:19:27,  1.65s/it]tensor([[-4.8438, -3.0469,  1.8125, -0.5430, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -3.1719,  0.4863,  0.2490, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3594, -2.4219,  0.5586,  1.9531, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:01:00,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.39 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.4375, -3.9062,  0.7891, -0.0071, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500, -1.4688,  1.6484,  0.4414, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594, -2.7500,  0.5977,  1.4375, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.1406,  0.4648,  0.4434, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -2.4375,  2.0156, -0.3145, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:01,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:01:01,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.19 | bwd_microstep: 1.91 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.96
[2025-11-06 18:01:01,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 505.61 | bwd: 2.85 | bwd_inner: 1.88 | bwd_allreduce: 0.84 | step: 2.04
 18%|█▊        | 620/3507 [16:14<1:03:36,  1.32s/it]                                                    {'loss': 0.3876, 'learning_rate': 1.8893863458926185e-05, 'epoch': 0.18}
 18%|█▊        | 620/3507 [16:14<1:03:36,  1.32s/it]tensor([[-0.5234,  1.0234,  4.4375,  1.4844, -0.3164]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9844, -2.0781,  0.8242,  2.5156, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:01,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.13 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3906, -2.4688,  0.4160,  2.1719, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4219, -1.7188,  0.5117,  2.2188, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0469, -1.3125,  2.3281, -1.2109, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.2188, -4.4375, -1.5156,  1.0000, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -2.7031, -0.9648,  1.6406, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -2.5156,  1.3438,  0.5039, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:01:03,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:01:03,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.67 | bwd_microstep: 1604.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1603.01 | step_microstep: 1.78
[2025-11-06 18:01:03,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 215.82 | bwd: 1604.98 | bwd_inner: 1.75 | bwd_allreduce: 1603.07 | step: 1.87
 18%|█▊        | 621/3507 [16:16<1:11:11,  1.48s/it]                                                    {'loss': 0.9451, 'learning_rate': 1.888963679232486e-05, 'epoch': 0.18}
 18%|█▊        | 621/3507 [16:16<1:11:11,  1.48s/it]tensor([[-2.8594, -1.5859,  1.9766,  1.4062, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:03,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.83 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.7188, -1.3984,  1.4844, -0.3887, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.2500,  0.1226,  0.7969, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -2.3750,  1.5234,  0.9492, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.9375, -0.4805,  0.6484, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6094, -1.9141,  2.5469, -0.1934, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -2.3125,  2.0938, -0.4238, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -2.9688,  0.6094,  0.9023, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:01:03,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:01:03,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.01 | bwd_microstep: 499.24 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 498.06 | step_microstep: 1.74
[2025-11-06 18:01:03,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.86 | bwd: 500.38 | bwd_inner: 2.14 | bwd_allreduce: 498.10 | step: 1.82
 18%|█▊        | 622/3507 [16:17<1:01:42,  1.28s/it]                                                    {'loss': 0.4747, 'learning_rate': 1.8885402540463598e-05, 'epoch': 0.18}
 18%|█▊        | 622/3507 [16:17<1:01:42,  1.28s/it]tensor([[-4.5625, -2.9844,  1.5078,  0.0859, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3125, -2.0625,  1.6250,  1.4219, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:03,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.86 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -2.8750,  0.6836,  1.3906, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -3.2656, -1.1875,  1.5781, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9531, -2.6875,  1.1641,  1.3906, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.6562,  1.3984,  0.3066, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -3.7188,  0.3770,  0.3145, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4688, -2.3750,  0.9766,  1.6172, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:01:05,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:01:05,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.58 | bwd_microstep: 1861.26 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1860.08 | step_microstep: 1.93
[2025-11-06 18:01:05,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 263.45 | bwd: 1862.11 | bwd_inner: 1.85 | bwd_allreduce: 1860.12 | step: 2.01
 18%|█▊        | 623/3507 [16:19<1:14:17,  1.55s/it]                                                    {'loss': 0.5701, 'learning_rate': 1.8881160706955364e-05, 'epoch': 0.18}
 18%|█▊        | 623/3507 [16:19<1:14:17,  1.55s/it]tensor([[-2.3750, -1.3594,  1.5156,  2.1719, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:06,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.93 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.16
tensor([[-5.6562, -4.2500, -0.0422, -0.3594, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -2.7969,  0.5508,  1.0391, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.7500,  1.2109,  0.8750, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3281, -1.6484,  0.5312,  2.2188, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -3.5625, -0.2188, -0.3555, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8516, -0.3359,  3.0312,  0.8984, -1.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -1.8203,  2.2969, -0.2930, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:01:06,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:01:06,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.95 | bwd_microstep: 137.14 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 136.18 | step_microstep: 1.55
[2025-11-06 18:01:06,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.91 | bwd: 138.03 | bwd_inner: 1.69 | bwd_allreduce: 136.22 | step: 1.70
 18%|█▊        | 624/3507 [16:20<59:26,  1.24s/it]                                                    {'loss': 0.4028, 'learning_rate': 1.887691129541959e-05, 'epoch': 0.18}
 18%|█▊        | 624/3507 [16:20<59:26,  1.24s/it]tensor([[-2.6250, -2.1719, -0.3555,  2.8594, -1.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:06,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.64 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.8594, -2.2969,  1.9531,  0.2012, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -2.6094, -0.6406,  1.7500, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -3.7031, -0.7383,  1.2812, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -2.5000,  1.8984, -1.1094, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -3.6250,  0.5898,  0.3926, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -2.7656,  0.8359,  0.6758, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.4531,  1.5703, -0.9766, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:01:06,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:01:06,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 63.25 | bwd_microstep: 178.15 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 176.94 | step_microstep: 1.55
[2025-11-06 18:01:06,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 173.89 | bwd: 179.10 | bwd_inner: 1.93 | bwd_allreduce: 177.00 | step: 1.65
 18%|█▊        | 625/3507 [16:20<47:05,  1.02it/s]                                                  {'loss': 0.2614, 'learning_rate': 1.8872654309482163e-05, 'epoch': 0.18}
 18%|█▊        | 625/3507 [16:20<47:05,  1.02it/s]tensor([[-2.1719, -0.9297,  2.4531,  2.0000, -1.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -2.7812,  0.0315,  1.7578, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9844, -2.2500,  0.1992,  2.2344, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -2.9688, -0.6797,  1.1094, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:07,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.73 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3594, -1.5625,  2.5312, -1.1016, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -2.5312,  1.5547,  0.3359, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -2.8906,  0.5859,  0.4629, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[1.4062, 2.2500, 4.0000, 3.9688, 1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:01:08,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:01:08,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.25 | bwd_microstep: 678.88 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 677.76 | step_microstep: 1.54
[2025-11-06 18:01:08,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.00 | bwd: 679.72 | bwd_inner: 1.80 | bwd_allreduce: 677.80 | step: 1.63
 18%|█▊        | 626/3507 [16:21<50:00,  1.04s/it]                                                  {'loss': 0.6526, 'learning_rate': 1.8868389752775447e-05, 'epoch': 0.18}
 18%|█▊        | 626/3507 [16:21<50:00,  1.04s/it]tensor([[-6.3125, -4.8438, -0.3457, -1.0625, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -2.1875,  0.0110,  2.1094, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.0156,  1.0781,  0.1904, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594, -1.8438,  2.0625, -0.1963, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:08,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.36 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0312, -2.8906,  0.2246,  0.0503, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -2.2812,  0.5000,  2.5469, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.3750,  1.4766, -0.8828, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -2.4531,  1.7266,  0.5625, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:10,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:01:10,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.71 | bwd_microstep: 1478.98 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1477.76 | step_microstep: 2.19
[2025-11-06 18:01:10,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 552.11 | bwd: 1479.92 | bwd_inner: 1.95 | bwd_allreduce: 1477.81 | step: 2.29
 18%|█▊        | 627/3507 [16:23<1:04:56,  1.35s/it]                                                    {'loss': 0.3917, 'learning_rate': 1.886411762893826e-05, 'epoch': 0.18}
 18%|█▊        | 627/3507 [16:23<1:04:56,  1.35s/it]tensor([[-2.6719, -1.1250,  2.8281,  0.8242, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -4.4375, -0.8086, -0.7461, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:10,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.45 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4219, -2.7031, -0.1426,  1.7188, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -1.9609,  1.9219, -0.7852, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2969, -2.2656,  0.9766,  2.0469, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -3.2656, -0.2949,  1.5938, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.2812,  0.9688,  0.6445, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -1.9062,  1.7109, -0.5234, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:01:10,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:01:10,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.81 | bwd_microstep: 252.64 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 251.32 | step_microstep: 1.60
[2025-11-06 18:01:10,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.28 | bwd: 253.61 | bwd_inner: 2.12 | bwd_allreduce: 251.37 | step: 1.69
 18%|█▊        | 628/3507 [16:24<54:28,  1.14s/it]                                                    {'loss': 1.1537, 'learning_rate': 1.8859837941615878e-05, 'epoch': 0.18}
 18%|█▊        | 628/3507 [16:24<54:28,  1.14s/it]tensor([[-3.4844, -2.0156,  1.8125,  0.1030, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125, -1.5469,  2.8125, -0.0645, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -2.5938,  0.9453,  1.0156, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -2.4531,  2.0625, -1.4453, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:11,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.35 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0625, -1.6328,  1.8516, -0.2070, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6875, -3.1250,  1.3750,  0.1387, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1875, -3.4688, -0.8555,  1.4844, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9688, -1.3906,  2.1094, -0.5820, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:01:12,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:01:12,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.79 | bwd_microstep: 1057.16 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1056.03 | step_microstep: 1.59
[2025-11-06 18:01:12,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.15 | bwd: 1058.11 | bwd_inner: 1.91 | bwd_allreduce: 1056.07 | step: 1.67
 18%|█▊        | 629/3507 [16:26<59:39,  1.24s/it]                                                  {'loss': 1.2947, 'learning_rate': 1.8855550694460026e-05, 'epoch': 0.18}
 18%|█▊        | 629/3507 [16:26<59:39,  1.24s/it]tensor([[-4.8750, -3.0938,  1.3281, -1.8359, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:12,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.97 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9219, -3.3750, -1.1797,  1.6172, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -3.9219,  1.0234, -0.5234, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.7031, -1.5156,  1.2344, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -3.8906, -0.6328,  0.9688, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594, -2.9531,  0.0366,  1.8125, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -2.4219,  1.3984, -0.0889, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -3.2969,  0.5625,  0.1699, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:01:14,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:01:14,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.91 | bwd_microstep: 1421.79 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 1420.52 | step_microstep: 2.17
[2025-11-06 18:01:14,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.91 | bwd: 1422.72 | bwd_inner: 2.03 | bwd_allreduce: 1420.56 | step: 2.24
 18%|█▊        | 630/3507 [16:27<1:07:35,  1.41s/it]                                                    {'loss': 0.4467, 'learning_rate': 1.8851255891128883e-05, 'epoch': 0.18}
 18%|█▊        | 630/3507 [16:27<1:07:35,  1.41s/it]tensor([[-2.8750, -1.0938,  2.6562, -1.0625, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:01:14,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.98 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0312, -2.6875,  1.0078,  0.0427, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -3.4219, -1.2812,  1.2656, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -2.6562,  0.9961,  1.0859, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.3438,  0.7578,  0.4668, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8906, -2.0625,  2.5312, -0.5469, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -3.1406,  0.1650,  0.5000, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -2.6562,  0.6797,  1.0078, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:15,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:01:15,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.02 | bwd_microstep: 1122.57 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1121.60 | step_microstep: 1.90
[2025-11-06 18:01:15,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.01 | bwd: 1123.52 | bwd_inner: 1.74 | bwd_allreduce: 1121.65 | step: 1.99
 18%|█▊        | 631/3507 [16:29<1:08:01,  1.42s/it]                                                    {'loss': 1.0808, 'learning_rate': 1.8846953535287078e-05, 'epoch': 0.18}
 18%|█▊        | 631/3507 [16:29<1:08:01,  1.42s/it]tensor([[-3.4844, -2.6406,  0.1260,  1.4141, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -2.7812,  0.3887,  1.3516, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.1875, -0.3848, -0.3086, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:15,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.38 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8438, -1.9219,  0.9453,  2.2344, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9531, -1.4922,  2.3125,  0.5586, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -2.1406,  1.1250,  1.4609, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4219, -1.6172,  2.4688, -1.0078, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -4.9375, -1.1172, -0.2314, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:01:17,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.22
[2025-11-06 18:01:17,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.56 | bwd_microstep: 1221.61 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1220.56 | step_microstep: 1.95
[2025-11-06 18:01:17,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.97 | bwd: 1222.50 | bwd_inner: 1.76 | bwd_allreduce: 1220.61 | step: 2.03
 18%|█▊        | 632/3507 [16:30<1:11:11,  1.49s/it]                                                    {'loss': 0.344, 'learning_rate': 1.884264363060568e-05, 'epoch': 0.18}
 18%|█▊        | 632/3507 [16:30<1:11:11,  1.49s/it]tensor([[-3.5469, -1.9062,  2.3750, -0.0879, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3438, -1.7422,  2.0781,  0.1069, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312, -2.6250,  0.2041,  1.4609, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:17,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.28 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.8438, -2.2188, -0.1133,  2.2500, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -3.1562,  1.5938, -0.2930, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1250, -2.3750,  1.9688, -0.6211, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.6250, 2.5938, 4.5938, 4.3125, 1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -2.8750, -0.2539,  2.2656, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:01:18,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:01:18,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.31 | bwd_microstep: 813.43 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 811.79 | step_microstep: 2.12
[2025-11-06 18:01:18,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.60 | bwd: 814.30 | bwd_inner: 2.34 | bwd_allreduce: 811.82 | step: 2.21
 18%|█▊        | 633/3507 [16:32<1:06:16,  1.38s/it]                                                    {'loss': 0.2465, 'learning_rate': 1.8838326180762205e-05, 'epoch': 0.18}
 18%|█▊        | 633/3507 [16:32<1:06:16,  1.38s/it]tensor([[-6.4062, -5.4375, -1.9766, -0.1299, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -2.2344,  2.2031, -1.3203, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -3.2812,  0.5000,  1.4531, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:18,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.03 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -3.7500, -1.3906,  1.0156, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -1.8047,  1.8672, -0.8516, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.7500, -1.4219,  1.9531,  0.5859, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375, -1.3750,  2.4844,  0.4473, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1562, -0.7969,  2.5000,  1.1250, -1.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:01:18,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:01:18,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.22 | bwd_microstep: 52.90 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 51.90 | step_microstep: 3.17
[2025-11-06 18:01:18,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 434.28 | bwd: 54.15 | bwd_inner: 2.09 | bwd_allreduce: 51.94 | step: 3.25
 18%|█▊        | 634/3507 [16:32<53:59,  1.13s/it]                                                    {'loss': 0.6399, 'learning_rate': 1.88340011894406e-05, 'epoch': 0.18}
 18%|█▊        | 634/3507 [16:32<53:59,  1.13s/it]tensor([[-4.8125, -4.2500, -2.0000,  0.3789, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.3438, -3.6875, -1.2500,  1.0156, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:19,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.06 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7500, -4.1562, -1.7422,  1.1484, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -4.5000, -0.3438,  0.0437, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0469, -2.5625, -0.7070,  1.8594, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562, -1.5234,  1.7734,  2.4062, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0938, -4.9375, -0.9141,  0.6211, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-10.6250,  -8.8125,  -2.8438,  -3.6406,  -8.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:19,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:01:19,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.50 | bwd_microstep: 690.77 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 689.61 | step_microstep: 1.60
[2025-11-06 18:01:19,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.58 | bwd: 691.59 | bwd_inner: 1.82 | bwd_allreduce: 689.65 | step: 1.67
 18%|█▊        | 635/3507 [16:33<53:29,  1.12s/it]                                                  {'loss': 0.8007, 'learning_rate': 1.8829668660331252e-05, 'epoch': 0.18}
 18%|█▊        | 635/3507 [16:33<53:29,  1.12s/it]tensor([[-4.1250, -2.8438,  0.6289, -0.3926, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -2.7812,  1.5859,  0.3242, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -2.4531,  1.8906, -0.1367, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -4.4688, -2.0156,  0.4004, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:21,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.79 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7031, -2.7812,  0.0388,  1.0469, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -2.0156,  2.3906, -0.2246, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -3.7969, -0.4727,  1.0312, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7969, -1.5156,  1.9219,  1.6172, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:01:22,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:01:22,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 335.67 | bwd_microstep: 332.44 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 330.79 | step_microstep: 1.72
[2025-11-06 18:01:22,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 520.47 | bwd: 333.44 | bwd_inner: 2.46 | bwd_allreduce: 330.83 | step: 1.80
 18%|█▊        | 636/3507 [16:35<1:08:10,  1.42s/it]                                                    {'loss': 0.263, 'learning_rate': 1.882532859713097e-05, 'epoch': 0.18}
 18%|█▊        | 636/3507 [16:35<1:08:10,  1.42s/it]tensor([[-2.8906, -1.5312,  1.7891,  0.1885, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -3.1250, -1.0781,  1.4531, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:01:22,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.34 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.8516, -0.4629,  2.7812,  0.7578, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -2.7969,  1.5000, -0.0613, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -3.4531, -1.2188,  2.0000, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3906, -0.8242,  2.5000,  0.2402, -1.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.2969, -2.6094, -0.1270,  2.3438, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -2.5938,  1.4531, -0.1660, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:01:22,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:01:22,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 61.43 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 60.33 | step_microstep: 1.88
[2025-11-06 18:01:22,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.70 | bwd: 62.49 | bwd_inner: 2.00 | bwd_allreduce: 60.37 | step: 1.97
 18%|█▊        | 637/3507 [16:36<54:08,  1.13s/it]                                                    {'loss': 1.0797, 'learning_rate': 1.8820981003543013e-05, 'epoch': 0.18}
 18%|█▊        | 637/3507 [16:36<54:08,  1.13s/it]tensor([[-2.7344, -1.5234,  1.5547,  0.3730, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6562, -1.8984,  2.4844, -0.6172, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1875, -2.1094,  1.0312,  1.6016, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -3.5938, -0.2559,  1.4219, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:24,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.67 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0000, -2.0469,  0.5039,  1.0938, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9531, -2.4219, -0.5391,  1.9297, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3906, -0.9609,  2.2969,  0.4746, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -4.0312, -0.9609,  0.8047, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:01:24,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:01:24,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.30 | bwd_microstep: 186.12 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 185.12 | step_microstep: 1.96
[2025-11-06 18:01:24,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 478.99 | bwd: 187.17 | bwd_inner: 1.88 | bwd_allreduce: 185.16 | step: 2.04
 18%|█▊        | 638/3507 [16:38<1:08:32,  1.43s/it]                                                    {'loss': 0.3289, 'learning_rate': 1.8816625883277044e-05, 'epoch': 0.18}
 18%|█▊        | 638/3507 [16:38<1:08:32,  1.43s/it]tensor([[-0.7500, -0.2832,  1.0938,  2.9062, -0.2441]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1406, -0.9102,  1.9844,  0.7539, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:24,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.11 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2500, -1.6797,  2.0938,  0.0116, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -2.8906, -0.1016,  1.4453, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2812,  0.5508,  4.2500,  0.1953, -1.0703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4219, -2.4688,  0.3691,  1.0469, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8438, -5.2500, -2.7656, -0.1768, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0312, -1.2188,  3.1094, -0.5820, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:25,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.19 | optimizer_step: 0.26
[2025-11-06 18:01:25,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.06 | bwd_microstep: 701.90 | bwd_inner_microstep: 1.51 | bwd_allreduce_microstep: 700.31 | step_microstep: 2.06
[2025-11-06 18:01:25,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.20 | bwd: 702.85 | bwd_inner: 2.37 | bwd_allreduce: 700.35 | step: 2.14
 18%|█▊        | 639/3507 [16:39<1:03:09,  1.32s/it]                                                    {'loss': 0.8176, 'learning_rate': 1.8812263240049152e-05, 'epoch': 0.18}
 18%|█▊        | 639/3507 [16:39<1:03:09,  1.32s/it]tensor([[-4.0938, -2.8750,  0.4609, -0.0493, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9453, -0.3633,  2.8438, -0.2363, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -3.5469,  1.1953, -0.0432, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -3.3906,  1.5000, -1.1719, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.3281, -0.2676,  1.4531, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -2.0312,  0.5234,  1.9453, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7891, -0.1245,  3.2344, -0.2598, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:26,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.93 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.6719, -2.7188,  0.3945,  1.7812, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:26,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:01:26,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.80 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.16
[2025-11-06 18:01:26,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.76 | bwd: 2.98 | bwd_inner: 1.99 | bwd_allreduce: 0.85 | step: 2.25
 18%|█▊        | 640/3507 [16:40<1:02:29,  1.31s/it]                                                    {'loss': 0.2141, 'learning_rate': 1.8807893077581863e-05, 'epoch': 0.18}
 18%|█▊        | 640/3507 [16:40<1:02:29,  1.31s/it]tensor([[-4.6562, -2.9219,  1.7891, -0.3184, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -2.5938,  1.7812, -0.5508, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -3.8438, -0.9453,  1.1797, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -2.4375,  1.1016,  1.5938, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -2.7188,  1.3906, -0.4824, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -2.1562,  2.1719, -0.4316, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:27,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.22 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8906, -2.4688,  1.2578,  0.8945, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4844, -1.0391,  2.2500, -0.2324, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:27,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.36 | optimizer_step: 0.30
[2025-11-06 18:01:27,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.28 | bwd_microstep: 116.21 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 115.17 | step_microstep: 2.69
[2025-11-06 18:01:27,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.52 | bwd: 117.16 | bwd_inner: 1.76 | bwd_allreduce: 115.25 | step: 2.78
 18%|█▊        | 641/3507 [16:41<55:12,  1.16s/it]                                                    {'loss': 0.288, 'learning_rate': 1.880351539960409e-05, 'epoch': 0.18}
 18%|█▊        | 641/3507 [16:41<55:12,  1.16s/it]tensor([[-4.0312, -3.4844, -1.2969,  1.5234, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -3.5156, -1.5938,  1.3594, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.5312, -1.1172,  0.7031, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0156, -2.5469, -0.6562,  2.3906, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.0625,  1.2500,  0.2119, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -2.7031, -0.0542,  1.9844, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -1.4688,  2.4844,  0.2773, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:30,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.15 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4062, -4.5000, -1.3125,  0.6797, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:30,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:01:30,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.34 | bwd_microstep: 2.17 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.26
[2025-11-06 18:01:30,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.51 | bwd: 3.12 | bwd_inner: 2.05 | bwd_allreduce: 0.94 | step: 2.35
 18%|█▊        | 642/3507 [16:44<1:22:54,  1.74s/it]                                                    {'loss': 0.5497, 'learning_rate': 1.8799130209851182e-05, 'epoch': 0.18}
 18%|█▊        | 642/3507 [16:44<1:22:54,  1.74s/it]tensor([[-6.0625, -4.6250, -0.2178, -0.3145, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:31,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.71 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.0312, -3.3906, -0.9727,  1.6641, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9844, -1.9688,  1.1094,  2.2344, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -2.0312,  0.0289,  2.3906, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.9062, -6.3750, -1.9688, -2.9375, -6.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -3.2656,  0.1631,  0.5273, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -2.9531,  1.0156,  0.8047, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -1.7344,  1.8281,  0.2559, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:31,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:01:31,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.56 | bwd_microstep: 511.82 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 510.70 | step_microstep: 2.17
[2025-11-06 18:01:31,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.30 | bwd: 512.78 | bwd_inner: 1.90 | bwd_allreduce: 510.74 | step: 2.26
 18%|█▊        | 643/3507 [16:45<1:10:02,  1.47s/it]                                                    {'loss': 0.4019, 'learning_rate': 1.879473751206489e-05, 'epoch': 0.18}
 18%|█▊        | 643/3507 [16:45<1:10:02,  1.47s/it]tensor([[-3.4844, -2.6875, -0.0645,  1.9062, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562, -2.5625, -0.4297,  2.0000, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -3.5312,  1.4062, -1.6328, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -1.8359,  1.8750, -0.3926, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -3.2500,  0.7070, -0.3301, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -3.2031, -0.3281,  0.9414, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.9375,  0.6523,  0.0211, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:33,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.84 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.8750, -2.6562,  0.8477,  1.1094, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:33,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:01:33,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.03 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.06
[2025-11-06 18:01:33,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.89 | bwd: 2.82 | bwd_inner: 1.83 | bwd_allreduce: 0.87 | step: 2.15
 18%|█▊        | 644/3507 [16:47<1:11:39,  1.50s/it]                                                    {'loss': 0.5692, 'learning_rate': 1.879033730999337e-05, 'epoch': 0.18}
 18%|█▊        | 644/3507 [16:47<1:11:39,  1.50s/it]tensor([[-2.1875, -0.7422,  2.5156, -0.0835, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8125, -4.2812,  0.3730, -0.1416, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -0.8320,  2.5312, -0.7617, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:01:33,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.00 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.4688, -4.2812, -0.5859,  0.1299, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -5.3438, -2.3281, -0.2402, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.6250,  2.1719, -0.4629, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1719, -2.4375, -0.0310,  1.5781, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -3.3281,  1.3203,  0.5234, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:01:33,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:01:33,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.99 | bwd_microstep: 74.91 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 73.77 | step_microstep: 1.93
[2025-11-06 18:01:33,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.03 | bwd: 75.89 | bwd_inner: 1.88 | bwd_allreduce: 73.83 | step: 2.04
 18%|█▊        | 645/3507 [16:47<56:55,  1.19s/it]                                                    {'loss': 0.8259, 'learning_rate': 1.8785929607391184e-05, 'epoch': 0.18}
 18%|█▊        | 645/3507 [16:47<56:55,  1.19s/it]tensor([[-3.6719, -1.9062,  2.3281, -0.7266, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438, -1.6172,  2.6406, -0.0835, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.6250, -7.4688, -3.2188, -1.4844, -6.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9688, -1.8281,  1.4141,  1.7266, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7734, -0.2354,  3.1875,  0.2949, -1.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -3.4219,  0.1621,  1.4141, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2031, -0.7070,  2.3281, -0.2402, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:35,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.46 | bwd_microstep: 1.20 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15
tensor([[-3.1719, -1.3906,  2.6094, -0.9609, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:35,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:01:35,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.89 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.47
[2025-11-06 18:01:35,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.36 | bwd: 3.14 | bwd_inner: 2.06 | bwd_allreduce: 0.88 | step: 2.61
 18%|█▊        | 646/3507 [16:49<1:06:54,  1.40s/it]                                                    {'loss': 0.2216, 'learning_rate': 1.878151440801929e-05, 'epoch': 0.18}
 18%|█▊        | 646/3507 [16:49<1:06:54,  1.40s/it]tensor([[-4.7812, -3.6719, -0.0859,  1.1406, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -1.7969,  2.0156, -0.3730, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:35,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.44 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.7188, -2.0312,  2.2969, -0.2021, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.1250,  1.5625, -0.1289, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625, -2.5781,  0.2773,  0.9844, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -3.0625,  1.5234, -0.6797, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -2.1250,  0.8828,  0.6172, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -2.5000,  0.5156,  1.6406, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:01:36,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:01:36,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.67 | bwd_microstep: 97.73 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 96.63 | step_microstep: 2.30
[2025-11-06 18:01:36,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.15 | bwd: 98.58 | bwd_inner: 1.76 | bwd_allreduce: 96.68 | step: 2.39
 18%|█▊        | 647/3507 [16:49<52:59,  1.11s/it]                                                    {'loss': 0.304, 'learning_rate': 1.877709171564504e-05, 'epoch': 0.18}
 18%|█▊        | 647/3507 [16:49<52:59,  1.11s/it]tensor([[4.0938, 4.7188, 5.6562, 6.2812, 3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1406, -1.3281,  3.0156, -0.5039, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.3438,  0.8242,  0.4492, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -2.4375, -0.5391,  2.1094, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -1.5469,  2.6562, -1.0938, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5781, -2.0781,  1.7500,  0.1040, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:38,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.80 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2031, -1.4766,  2.3281, -0.7773, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2344, -0.9492,  1.9141,  0.5859, -1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:01:39,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.14
[2025-11-06 18:01:39,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.37 | bwd_microstep: 389.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 389.04 | step_microstep: 1.75
[2025-11-06 18:01:39,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.17 | bwd: 390.77 | bwd_inner: 1.57 | bwd_allreduce: 389.08 | step: 1.82
 18%|█▊        | 648/3507 [16:53<1:23:21,  1.75s/it]                                                    {'loss': 0.2951, 'learning_rate': 1.8772661534042195e-05, 'epoch': 0.18}
 18%|█▊        | 648/3507 [16:53<1:23:21,  1.75s/it]tensor([[-3.7031, -2.1406,  1.5000, -0.4727, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -2.8125,  1.5000,  0.0237, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -4.1562, -0.9688,  0.9570, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -3.1719, -0.1699,  1.2266, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:39,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.35 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.4062, -1.5703,  2.6719, -0.8633, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.8125, -5.0625, -2.1875,  0.2070, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -4.3750, -1.4609,  1.1094, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -3.2656,  0.3770,  0.8125, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:01:39,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:01:39,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.60 | bwd_microstep: 43.70 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 42.46 | step_microstep: 1.85
[2025-11-06 18:01:39,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.97 | bwd: 44.62 | bwd_inner: 1.97 | bwd_allreduce: 42.51 | step: 1.95
 19%|█▊        | 649/3507 [16:53<1:04:15,  1.35s/it]                                                    {'loss': 0.722, 'learning_rate': 1.8768223866990884e-05, 'epoch': 0.19}
 19%|█▊        | 649/3507 [16:53<1:04:15,  1.35s/it]tensor([[-3.5000, -2.4062,  0.8359,  1.6875, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -2.8438,  1.1797,  0.8594, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -2.4375,  1.8438,  0.6484, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:40,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.26 | bwd_microstep: 1.17 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-4.2812, -2.9688,  0.7773,  0.6367, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7188, -1.0391,  2.8750, -0.0674, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7344, -0.4160,  0.5781,  3.0781, -0.1162]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7500, -5.2812, -0.6484, -0.5742, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156, -2.0781, -0.4844,  2.0312, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:01:41,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:01:41,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.00 | bwd_microstep: 623.69 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 622.56 | step_microstep: 2.12
[2025-11-06 18:01:41,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 428.27 | bwd: 624.88 | bwd_inner: 2.04 | bwd_allreduce: 622.64 | step: 2.25
 19%|█▊        | 650/3507 [16:55<1:05:55,  1.38s/it]                                                    {'loss': 0.4215, 'learning_rate': 1.8763778718277645e-05, 'epoch': 0.19}
 19%|█▊        | 650/3507 [16:55<1:05:55,  1.38s/it]tensor([[-2.5469, -0.7031,  3.1562, -0.4434, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:41,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.70 | bwd_microstep: 1.63 | bwd_inner_microstep: 1.47 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.1562, -2.3594,  0.1660,  2.0469, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -2.1719,  1.6562, -0.2812, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.8594,  0.8516,  0.0613, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781, -1.2734,  1.8438,  0.3789, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.7500, -3.1094,  1.4062, -0.3457, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4844, -0.2041,  2.7344,  1.0312, -1.1016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5859, -0.2334,  3.0938,  2.4062, -1.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:01:41,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:01:41,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.46 | bwd_microstep: 56.24 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 54.98 | step_microstep: 1.69
[2025-11-06 18:01:41,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.16 | bwd: 57.87 | bwd_inner: 2.66 | bwd_allreduce: 55.04 | step: 1.79
 19%|█▊        | 651/3507 [16:55<51:30,  1.08s/it]                                                    {'loss': 0.8502, 'learning_rate': 1.8759326091695385e-05, 'epoch': 0.19}
 19%|█▊        | 651/3507 [16:55<51:30,  1.08s/it]tensor([[-6.5938, -5.4375, -1.4922,  0.3867, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -2.6719,  0.5078,  0.8945, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0469,  0.4180,  3.4531,  1.7812, -0.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:42,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.99 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.9062, -5.2500, -2.6562,  0.0767, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -3.5156,  1.5078, -1.5703, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -2.1562,  1.2266,  0.9453, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031, -1.1484,  2.2500, -0.3477, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -2.5312,  0.9375, -0.7070, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:01:43,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:01:43,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.16 | bwd_microstep: 1142.30 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1141.19 | step_microstep: 1.65
[2025-11-06 18:01:43,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.15 | bwd: 1143.42 | bwd_inner: 2.05 | bwd_allreduce: 1141.24 | step: 1.75
 19%|█▊        | 652/3507 [16:57<1:09:11,  1.45s/it]                                                    {'loss': 0.2514, 'learning_rate': 1.8754865991043402e-05, 'epoch': 0.19}
 19%|█▊        | 652/3507 [16:57<1:09:11,  1.45s/it]tensor([[-4.2188, -2.7656,  1.2344,  0.6797, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -4.4375, -0.7578, -0.2500, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:44,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.66 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5625, -3.4688, -0.0322,  1.0156, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.4219,  0.1475,  1.1484, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -2.9688, -0.6367,  1.6641, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -1.6484,  2.7969, -0.9961, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -2.6406,  0.9102,  0.2148, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -2.0156,  2.4844, -0.3926, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:44,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.23
[2025-11-06 18:01:44,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.70 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.96 | step_microstep: 1.85
[2025-11-06 18:01:44,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.38 | bwd: 3.00 | bwd_inner: 1.88 | bwd_allreduce: 1.00 | step: 1.94
 19%|█▊        | 653/3507 [16:58<55:34,  1.17s/it]                                                    {'loss': 0.3725, 'learning_rate': 1.8750398420127353e-05, 'epoch': 0.19}
 19%|█▊        | 653/3507 [16:58<55:34,  1.17s/it]tensor([[-4.6875, -3.0156,  1.4219, -0.1562, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -2.5312,  1.9297, -1.2734, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5469, -2.3906,  1.0000,  1.6875, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:45,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.48 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3281, -1.6719,  1.9141, -0.5273, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4844, -2.7969, -0.4707,  1.4609, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.8438, -3.7500, -0.3965,  0.5312, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -2.7969,  0.3613,  1.5391, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1562, -0.5000,  2.7500, -0.1377, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:01:46,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:01:46,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.51 | bwd_microstep: 776.23 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 774.94 | step_microstep: 1.91
[2025-11-06 18:01:46,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.01 | bwd: 777.13 | bwd_inner: 2.00 | bwd_allreduce: 774.98 | step: 1.99
 19%|█▊        | 654/3507 [17:00<1:06:22,  1.40s/it]                                                    {'loss': 1.4349, 'learning_rate': 1.8745923382759297e-05, 'epoch': 0.19}
 19%|█▊        | 654/3507 [17:00<1:06:22,  1.40s/it]tensor([[-3.4375, -2.9219, -0.9922,  1.9531, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5938, -2.0156,  1.6484,  0.1846, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:46,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.50 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -2.5781,  1.5078,  0.1328, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -3.5938,  0.4023,  0.5586, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -1.8594,  2.4688, -0.7930, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.7031, -2.3750,  0.9180, -0.5625, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5312, -3.7344, -1.0234,  1.1172, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2031, -1.9141,  1.3047, -0.1475, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:01:46,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:01:46,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.89 | bwd_microstep: 110.26 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 109.25 | step_microstep: 1.46
[2025-11-06 18:01:46,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.41 | bwd: 111.12 | bwd_inner: 1.72 | bwd_allreduce: 109.28 | step: 1.55
 19%|█▊        | 655/3507 [17:00<53:22,  1.12s/it]                                                    {'loss': 0.9513, 'learning_rate': 1.874144088275764e-05, 'epoch': 0.19}
 19%|█▊        | 655/3507 [17:00<53:22,  1.12s/it]tensor([[-2.7969, -1.3516,  1.9219, -0.1904, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4688, -0.7500,  2.8594,  0.6992, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -1.3203,  2.5000, -0.4277, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -2.3438, -0.0386,  1.4062, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -1.4609,  1.8984,  1.2812, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -3.3438, -0.7930,  1.0156, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -4.4688, -0.3125, -1.3906, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:49,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.43 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6406, -2.8438, -0.2617,  1.3750, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:49,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:01:49,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.46 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.81
[2025-11-06 18:01:49,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.91 | bwd: 2.65 | bwd_inner: 1.67 | bwd_allreduce: 0.84 | step: 1.89
 19%|█▊        | 656/3507 [17:03<1:18:16,  1.65s/it]                                                    {'loss': 0.6208, 'learning_rate': 1.8736950923947164e-05, 'epoch': 0.19}
 19%|█▊        | 656/3507 [17:03<1:18:16,  1.65s/it]tensor([[-3.7969, -2.5156,  1.1016,  1.2031, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -4.6250, -0.7969, -0.9375, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:49,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.90 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3438, -3.3594, -0.2930,  1.1172, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -5.0938, -1.8281,  0.0076, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.7656, -1.4219,  1.5078, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -3.7812,  1.0234, -0.6016, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-4.6562, -2.9375,  1.5391, -0.1089, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([2], device='cuda:1')tensor([2], device='cuda:3')

tensor([[-3.4688, -1.6875,  2.4375,  0.4922, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:01:50,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:01:50,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.53 | bwd_microstep: 64.30 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 63.17 | step_microstep: 2.00
[2025-11-06 18:01:50,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.46 | bwd: 65.20 | bwd_inner: 1.84 | bwd_allreduce: 63.21 | step: 2.09
 19%|█▊        | 657/3507 [17:03<1:00:18,  1.27s/it]                                                    {'loss': 0.2908, 'learning_rate': 1.8732453510159025e-05, 'epoch': 0.19}
 19%|█▊        | 657/3507 [17:03<1:00:18,  1.27s/it]tensor([[-0.5898,  0.4277,  2.9062,  3.2188, -0.1660]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -3.1719, -1.2266,  1.7031, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.9375, -3.3281, -1.1484,  1.4609, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -4.4062, -1.4531,  0.7109, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7109, -1.1328,  0.7109,  3.2031, -0.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:52,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.59 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8438, -3.4531,  0.6211,  0.4512, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -2.9219,  0.4395,  0.9141, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0781, -2.3594, -0.0908,  1.6328, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:01:52,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:01:52,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.68 | bwd_microstep: 39.05 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 37.86 | step_microstep: 1.72
[2025-11-06 18:01:52,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.29 | bwd: 40.05 | bwd_inner: 2.00 | bwd_allreduce: 37.90 | step: 1.81
 19%|█▉        | 658/3507 [17:06<1:18:41,  1.66s/it]                                                    {'loss': 0.8509, 'learning_rate': 1.872794864523072e-05, 'epoch': 0.19}
 19%|█▉        | 658/3507 [17:06<1:18:41,  1.66s/it]tensor([[-7.4375, -5.9375, -1.6016, -1.8594, -6.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -1.9844,  1.7422, -0.2021, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -1.9922,  2.4219, -1.1016, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -2.7344,  0.8906,  1.2891, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:52,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.00 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7812, -3.3594,  0.7539,  0.9805, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688e+00, -3.3750e+00, -1.0452e-03,  1.2656e+00, -3.3594e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -1.9375,  0.4805,  2.5312, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.0156,  0.3008,  1.2969, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:53,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:01:53,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.00 | bwd_microstep: 2.44 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.22
[2025-11-06 18:01:53,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 455.03 | bwd: 3.41 | bwd_inner: 2.44 | bwd_allreduce: 0.85 | step: 2.29
 19%|█▉        | 659/3507 [17:06<1:02:10,  1.31s/it]                                                    {'loss': 0.5429, 'learning_rate': 1.8723436333006124e-05, 'epoch': 0.19}
 19%|█▉        | 659/3507 [17:06<1:02:10,  1.31s/it]tensor([[-3.5625, -1.8359,  1.7812, -0.7578, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -3.5781,  0.6445, -1.6094, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5000, -2.3750,  0.8125,  1.5078, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -2.3438,  1.1406, -0.4102, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -2.7500, -0.1660,  0.9258, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -3.5312,  1.0312, -0.5312, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0469, -1.5234,  1.6953, -0.7852, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:55,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.48 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.7422, -0.0349,  3.2031,  0.1162, -1.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:01:56,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:01:56,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.85 | bwd_microstep: 2.01 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.21
[2025-11-06 18:01:56,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.36 | bwd: 2.97 | bwd_inner: 1.87 | bwd_allreduce: 0.89 | step: 2.30
 19%|█▉        | 660/3507 [17:10<1:26:50,  1.83s/it]                                                    {'loss': 0.6998, 'learning_rate': 1.871891657733545e-05, 'epoch': 0.19}
 19%|█▉        | 660/3507 [17:10<1:26:50,  1.83s/it]tensor([[-3.8438, -1.8047,  2.5625, -1.4062, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -4.0625, -1.6641,  1.6328, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -2.0156,  1.0859,  0.1504, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -3.8438, -1.6328,  0.9648, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:56,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.85 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.3750, -2.6875, -0.2637,  2.5625, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -3.0781, -0.3223,  2.0938, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.7188,  1.6719, -0.5391, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0625, -3.0469, -0.0233,  1.0156, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:01:56,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:01:56,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.84 | bwd_microstep: 76.61 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 75.49 | step_microstep: 1.86
[2025-11-06 18:01:56,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.71 | bwd: 77.45 | bwd_inner: 1.81 | bwd_allreduce: 75.52 | step: 1.93
 19%|█▉        | 661/3507 [17:10<1:07:49,  1.43s/it]                                                    {'loss': 0.7268, 'learning_rate': 1.8714389382075273e-05, 'epoch': 0.19}
 19%|█▉        | 661/3507 [17:10<1:07:49,  1.43s/it]tensor([[-3.6406, -2.5156,  0.7070,  1.7344, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -2.8281, -0.0566,  1.7969, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -2.3906,  0.9727,  0.1367, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7656, -2.0781,  2.0781,  0.4434, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2500, -2.0781,  1.0469,  1.1875, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -3.2656, -0.6289,  1.0938, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656, -2.1250,  1.0859,  2.0625, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:58,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.18 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-2.8438, -1.4062,  1.9062,  0.8477, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:01:58,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:01:58,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.40 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.28
[2025-11-06 18:01:58,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.56 | bwd: 2.69 | bwd_inner: 1.70 | bwd_allreduce: 0.87 | step: 2.35
 19%|█▉        | 662/3507 [17:12<1:19:31,  1.68s/it]                                                    {'loss': 0.3474, 'learning_rate': 1.870985475108851e-05, 'epoch': 0.19}
 19%|█▉        | 662/3507 [17:12<1:19:31,  1.68s/it]tensor([[-3.6875, -2.8594, -0.1875,  2.0156, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438, -2.4062, -0.6797,  2.3906, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:01:59,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.02 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-7.0312, -6.0312, -2.4219, -0.3262, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8008,  1.0156,  4.2812,  0.4082, -0.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -3.6094, -0.6641,  0.8984, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -2.8281,  1.3438, -0.7617, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7031,  0.8359,  3.7656,  0.9609, -0.5195]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.8281, -2.1562,  1.7500, -0.9180, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:01:59,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:01:59,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.95 | bwd_microstep: 45.56 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 44.42 | step_microstep: 1.91
[2025-11-06 18:01:59,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 420.00 | bwd: 46.31 | bwd_inner: 1.75 | bwd_allreduce: 44.45 | step: 1.98
 19%|█▉        | 663/3507 [17:13<1:02:48,  1.33s/it]                                                    {'loss': 0.4873, 'learning_rate': 1.8705312688244432e-05, 'epoch': 0.19}
 19%|█▉        | 663/3507 [17:13<1:02:48,  1.33s/it]tensor([[-2.8438, -2.2188, -0.2168,  1.8984, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.3750, -0.3945,  1.6172, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625, -1.5000,  2.2500,  0.6016, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -2.7188,  0.2031,  1.3906, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2656, -1.2422,  1.5391,  2.8594, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.7812,  1.0547,  1.0469, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719, -0.5742,  3.3438,  1.8516, -1.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:01,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.92 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5625, -4.1562, -0.3086, -0.5508, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:01,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:02:01,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.81 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.09
[2025-11-06 18:02:01,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.75 | bwd: 3.15 | bwd_inner: 2.09 | bwd_allreduce: 0.93 | step: 2.18
 19%|█▉        | 664/3507 [17:15<1:15:14,  1.59s/it]                                                    {'loss': 0.3262, 'learning_rate': 1.8700763197418638e-05, 'epoch': 0.19}
 19%|█▉        | 664/3507 [17:15<1:15:14,  1.59s/it]tensor([[-3.6875, -2.2188,  1.7031,  1.2656, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0938, -0.6016,  2.3281,  0.0466, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-9.3750, -7.7500, -2.7969, -2.7656, -7.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -2.8125,  0.9961, -0.2617, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:01,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.36 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.7188, -3.4375,  0.2480,  0.6758, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -4.3438, -0.6328,  0.5820, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219, -2.8750,  0.1807,  1.2969, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2031, -0.5938,  3.1406,  1.1250, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:02:02,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:02:02,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.46 | bwd_microstep: 122.02 | bwd_inner_microstep: 1.87 | bwd_allreduce_microstep: 120.03 | step_microstep: 2.08
[2025-11-06 18:02:02,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.84 | bwd: 123.01 | bwd_inner: 2.75 | bwd_allreduce: 120.06 | step: 2.17
 19%|█▉        | 665/3507 [17:15<59:32,  1.26s/it]                                                    {'loss': 0.5626, 'learning_rate': 1.8696206282493076e-05, 'epoch': 0.19}
 19%|█▉        | 665/3507 [17:15<59:32,  1.26s/it]tensor([[-5.4375, -3.8438,  0.5039, -0.1816, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -2.5781,  0.9141,  1.5859, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -3.5469, -0.8008,  1.0859, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.6562, -0.5469,  0.6523, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -2.9375,  0.1953,  0.9492, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9062, -1.7578,  1.3125,  2.1250, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -1.5469,  1.6562,  0.6992, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:03,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.79 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6719, -2.4688,  0.4414,  0.4277, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:04,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.12 | optimizer_step: 0.15
[2025-11-06 18:02:04,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.44 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.97
[2025-11-06 18:02:04,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.24 | bwd: 3.05 | bwd_inner: 2.07 | bwd_allreduce: 0.86 | step: 2.05
 19%|█▉        | 666/3507 [17:17<1:08:33,  1.45s/it]                                                    {'loss': 0.5864, 'learning_rate': 1.8691641947356022e-05, 'epoch': 0.19}
 19%|█▉        | 666/3507 [17:17<1:08:33,  1.45s/it]tensor([[-5.7500, -5.0312, -2.2188,  0.6523, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -2.5469,  0.4199,  1.1406, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9375, -2.3906, -0.3789,  2.3750, -1.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:04,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.79 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7344, -3.2031, -1.1875,  2.0938, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -0.6875,  2.3125, -0.5547, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.4219, -2.5000,  0.1279,  0.9805, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -2.9531,  1.0781,  0.8867, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -1.1484,  3.2031, -0.4648, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:04,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:02:04,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.71 | bwd_microstep: 99.03 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 97.80 | step_microstep: 1.43
[2025-11-06 18:02:04,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.52 | bwd: 99.92 | bwd_inner: 1.93 | bwd_allreduce: 97.85 | step: 1.52
 19%|█▉        | 667/3507 [17:18<55:40,  1.18s/it]                                                    {'loss': 0.6277, 'learning_rate': 1.868707019590209e-05, 'epoch': 0.19}
 19%|█▉        | 667/3507 [17:18<55:40,  1.18s/it]tensor([[-4.1875, -2.2812,  1.6328, -1.4844, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.0938, -1.7812,  1.1328,  0.1387, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7500, -5.8125, -2.4531, -0.2676, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5312, -1.4375,  1.3047,  2.2500, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5156, -1.9531, -0.1709,  2.1406, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -2.9219, -0.2988,  1.3750, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -4.1875, -1.3984,  1.2812, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:06,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.24 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.5938, -2.3125,  0.9844,  0.9570, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:06,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:02:06,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.48 | bwd_microstep: 1.88 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.23
[2025-11-06 18:02:06,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.74 | bwd: 2.78 | bwd_inner: 1.83 | bwd_allreduce: 0.83 | step: 2.31
 19%|█▉        | 668/3507 [17:20<1:13:12,  1.55s/it]                                                    {'loss': 1.202, 'learning_rate': 1.868249103203221e-05, 'epoch': 0.19}
 19%|█▉        | 668/3507 [17:20<1:13:12,  1.55s/it]tensor([[-3.2969, -1.7812,  1.6484, -0.1387, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -5.1250, -0.6406, -0.1768, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -1.5469,  2.5781, -0.2217, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:07,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.04 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.0938, -2.4688,  1.6875,  0.0117, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -2.3594, -0.4199,  1.8750, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -2.7344,  1.2969,  0.8398, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -2.6250,  1.5156,  0.1973, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0625, -0.5430,  2.4062,  0.3828, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:02:07,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:02:07,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.36 | bwd_microstep: 100.39 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 99.33 | step_microstep: 2.13
[2025-11-06 18:02:07,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.42 | bwd: 101.27 | bwd_inner: 1.78 | bwd_allreduce: 99.37 | step: 2.22
 19%|█▉        | 669/3507 [17:21<57:50,  1.22s/it]                                                    {'loss': 0.8488, 'learning_rate': 1.867790445965365e-05, 'epoch': 0.19}
 19%|█▉        | 669/3507 [17:21<57:50,  1.22s/it]tensor([[-2.2969, -0.6094,  2.5625, -0.3887, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -2.6250,  0.1016,  1.8359, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7188, -1.5156,  1.2266,  1.0391, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7188, -2.2500,  1.4453,  0.7148, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -1.1250,  2.8594, -0.7930, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -2.1406,  2.2969, -1.1094, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -2.4688,  0.0588,  1.9609, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:09,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.49 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-2.9062, -2.0938,  0.4414,  2.5938, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:09,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:02:09,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.56 | bwd_microstep: 1.77 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.15
[2025-11-06 18:02:09,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.08 | bwd: 2.78 | bwd_inner: 1.76 | bwd_allreduce: 0.87 | step: 2.27
 19%|█▉        | 670/3507 [17:23<1:06:48,  1.41s/it]                                                    {'loss': 0.4392, 'learning_rate': 1.8673310482679997e-05, 'epoch': 0.19}
 19%|█▉        | 670/3507 [17:23<1:06:48,  1.41s/it]tensor([[-4.2812, -2.5625,  1.3984, -0.9375, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -2.8594, -0.1797,  2.1406, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:09,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.64 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.4219, -2.8906, -0.9531,  1.8125, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375, -1.7891,  2.2031,  0.4043, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -1.8984,  0.7109,  2.5469, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0469, -0.6875,  2.2344,  0.8359, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9531, -1.7812,  1.0156,  1.2344, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7500, -5.3750, -1.1953,  0.0444, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:09,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:02:09,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.66 | bwd_microstep: 52.70 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 51.54 | step_microstep: 2.03
[2025-11-06 18:02:09,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.32 | bwd: 53.67 | bwd_inner: 1.85 | bwd_allreduce: 51.62 | step: 2.10
 19%|█▉        | 671/3507 [17:23<52:57,  1.12s/it]                                                    {'loss': 0.2325, 'learning_rate': 1.866870910503115e-05, 'epoch': 0.19}
 19%|█▉        | 671/3507 [17:23<52:57,  1.12s/it]tensor([[-3.8750, -2.5938,  0.8711,  1.4219, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.6875, -0.4258,  1.3047, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -2.5000,  2.0781, -0.3867, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.0312, -0.0825,  0.8125, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -4.6875, -0.5820,  0.2168, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625, -0.4551,  2.4375, -0.1689, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.4531, -2.9375, -1.1484,  1.0859, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:02:11,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.22 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8125, -2.9531,  1.6250, -0.5938, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:12,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:02:12,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.67 | bwd_microstep: 1.90 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.74
[2025-11-06 18:02:12,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.87 | bwd: 2.83 | bwd_inner: 1.87 | bwd_allreduce: 0.85 | step: 1.83
 19%|█▉        | 672/3507 [17:26<1:11:44,  1.52s/it]                                                    {'loss': 1.0388, 'learning_rate': 1.8664100330633327e-05, 'epoch': 0.19}
 19%|█▉        | 672/3507 [17:26<1:11:44,  1.52s/it]tensor([[-5.2500, -3.4219,  1.1250, -0.5625, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -1.9141,  1.0781,  1.6797, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9062, -2.0938,  2.2656,  0.0070, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:12,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.20 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7188, -3.0000, -0.6484,  1.6328, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3125, -2.7500,  1.1719, -0.2637, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.6250, -0.0452,  1.5156, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -2.4688,  0.5391,  0.7578, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -3.0312, -0.2949,  0.8672, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:15,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:02:15,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.72 | bwd_microstep: 2901.14 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 2900.07 | step_microstep: 3.97
[2025-11-06 18:02:15,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 412.94 | bwd: 2902.19 | bwd_inner: 1.95 | bwd_allreduce: 2900.11 | step: 4.05
 19%|█▉        | 673/3507 [17:29<1:37:42,  2.07s/it]                                                    {'loss': 0.6413, 'learning_rate': 1.865948416341906e-05, 'epoch': 0.19}
 19%|█▉        | 673/3507 [17:29<1:37:42,  2.07s/it]tensor([[-2.9531, -2.5156, -0.9609,  1.8672, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.5938, -1.4531,  1.1719,  0.9922, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -2.9531, -0.2520,  1.2812, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:02:15,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.91 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.3125, -4.3125, -1.1719,  0.5273, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -3.0156, -0.3887,  1.4766, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0781, -2.4062, -0.2295,  2.0781, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -4.5000, -1.6094,  1.3594, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -2.5781,  1.1328,  1.3516, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:15,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:02:15,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.23 | bwd_microstep: 18.55 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 17.39 | step_microstep: 2.79
[2025-11-06 18:02:15,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.17 | bwd: 19.31 | bwd_inner: 1.76 | bwd_allreduce: 17.43 | step: 2.86
 19%|█▉        | 674/3507 [17:29<1:14:22,  1.58s/it]                                                    {'loss': 0.9382, 'learning_rate': 1.8654860607327177e-05, 'epoch': 0.19}
 19%|█▉        | 674/3507 [17:29<1:14:22,  1.58s/it]tensor([[-2.5781, -2.0156, -0.2656,  2.2188, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:02:16,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.42 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8438, -3.3438,  0.7578,  0.8125, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5000, -0.8906,  2.2969, -0.3535, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.3984, -0.4375,  1.9922,  3.4062, -0.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -3.5312, -0.4941,  0.9453, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1250, -0.9180,  1.8750,  1.5156, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -0.9609,  1.8750,  0.4141, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -2.3281,  0.6211,  1.7266, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:02:17,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.42 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:02:17,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.39 | bwd_microstep: 923.15 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 921.77 | step_microstep: 3.65
[2025-11-06 18:02:17,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.83 | bwd: 924.03 | bwd_inner: 2.07 | bwd_allreduce: 921.82 | step: 3.74
 19%|█▉        | 675/3507 [17:31<1:09:45,  1.48s/it]                                                    {'loss': 1.2094, 'learning_rate': 1.8650229666302827e-05, 'epoch': 0.19}
 19%|█▉        | 675/3507 [17:31<1:09:45,  1.48s/it]tensor([[-2.8438, -1.7109,  1.2734,  2.1094, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625, -0.7500,  2.4375, -0.7188, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.1875, -5.1250, -1.6953,  0.0811, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -2.2188,  1.0781, -0.0723, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -2.9219,  1.0391,  1.0391, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -2.8438, -0.2432,  0.9609, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -1.5781,  2.3750, -1.1953, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:02:18,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.10 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.62 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.20
tensor([[-3.7188, -2.1562,  1.5078,  1.0859, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:18,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.22 | optimizer_step: 0.29
[2025-11-06 18:02:18,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.20 | bwd_microstep: 3.16 | bwd_inner_microstep: 1.78 | bwd_allreduce_microstep: 1.26 | step_microstep: 2.31
[2025-11-06 18:02:18,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.37 | bwd: 5.07 | bwd_inner: 3.43 | bwd_allreduce: 1.36 | step: 2.51
 19%|█▉        | 676/3507 [17:32<1:08:37,  1.45s/it]                                                    {'loss': 1.2137, 'learning_rate': 1.864559134429745e-05, 'epoch': 0.19}
 19%|█▉        | 676/3507 [17:32<1:08:37,  1.45s/it]tensor([[-4.3750, -3.8281, -1.6641,  1.5859, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:18,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.53 | bwd_microstep: 1.34 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14
tensor([[-4.2188, -3.4219, -0.8555,  1.4141, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -2.2656,  1.4219,  1.7031, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -2.8125,  0.6094,  0.2891, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -2.3906,  0.9844,  1.6016, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0469, -2.2812, -0.0708,  1.5703, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -2.4844,  0.2559,  2.2344, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -3.1875, -0.4941,  1.2578, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:02:20,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.32 | optimizer_step: 0.42
[2025-11-06 18:02:20,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 233.24 | bwd_microstep: 1194.22 | bwd_inner_microstep: 2.56 | bwd_allreduce_microstep: 1191.46 | step_microstep: 3.32
[2025-11-06 18:02:20,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.80 | bwd: 1195.56 | bwd_inner: 3.78 | bwd_allreduce: 1191.53 | step: 3.46
 19%|█▉        | 677/3507 [17:34<1:10:57,  1.50s/it]                                                    {'loss': 0.9352, 'learning_rate': 1.864094564526879e-05, 'epoch': 0.19}
 19%|█▉        | 677/3507 [17:34<1:10:57,  1.50s/it]tensor([[-4.1250, -3.0625,  0.0277,  0.8750, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3438, -0.6016,  2.2812, -0.8281, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:02:20,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.44 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.67 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.22
tensor([[-3.5781, -1.6250,  2.3438, -0.7188, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -4.8125, -0.9805,  0.6875, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -3.5312, -0.6094,  0.7383, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1094, -1.0469,  1.6641,  2.2031, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -2.2500,  1.3984,  1.4453, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -3.4844,  0.1016,  0.8203, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:20,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:02:20,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.35 | bwd_microstep: 321.54 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 320.62 | step_microstep: 2.78
[2025-11-06 18:02:20,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.81 | bwd: 323.49 | bwd_inner: 2.55 | bwd_allreduce: 320.72 | step: 3.00
 19%|█▉        | 678/3507 [17:34<59:32,  1.26s/it]                                                    {'loss': 0.689, 'learning_rate': 1.863629257318088e-05, 'epoch': 0.19}
 19%|█▉        | 678/3507 [17:34<59:32,  1.26s/it]tensor([[-2.2188, -0.6641,  1.9844, -0.8750, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.7188, -3.6094, -0.4180,  0.7383, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:21,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.63 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0625, -3.4062, -1.0703,  1.7578, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2188, -0.5820,  2.2656,  0.3848, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8281, -2.1094,  1.6719, -0.4023, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812, -0.9727,  2.7500, -0.2871, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -2.5625,  1.4297, -0.1318, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -1.7422,  2.2188, -1.4297, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:21,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:02:21,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.64 | bwd_microstep: 282.59 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 281.57 | step_microstep: 2.45
[2025-11-06 18:02:21,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.30 | bwd: 283.51 | bwd_inner: 1.74 | bwd_allreduce: 281.62 | step: 2.54
 19%|█▉        | 679/3507 [17:35<52:27,  1.11s/it]                                                  {'loss': 0.8406, 'learning_rate': 1.8631632132004048e-05, 'epoch': 0.19}
 19%|█▉        | 679/3507 [17:35<52:27,  1.11s/it]tensor([[-3.7344, -3.1719, -1.1875,  1.8750, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0938, -1.2422,  0.9922,  2.6406, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:21,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.27 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7344, -2.7812, -0.0118,  2.0000, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -1.2031,  2.3438, -1.4844, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -1.9219,  1.7891,  1.4141, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156, -1.3203,  2.1562,  0.0354, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7656, -1.8828,  2.2656, -0.1729, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -2.0000,  0.2832,  0.3008, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:02:22,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:02:22,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.65 | bwd_microstep: 890.62 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 889.47 | step_microstep: 1.86
[2025-11-06 18:02:22,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.94 | bwd: 891.45 | bwd_inner: 1.82 | bwd_allreduce: 889.51 | step: 1.93
 19%|█▉        | 680/3507 [17:36<54:01,  1.15s/it]                                                  {'loss': 0.3087, 'learning_rate': 1.8626964325714903e-05, 'epoch': 0.19}
 19%|█▉        | 680/3507 [17:36<54:01,  1.15s/it]tensor([[-3.1719, -1.3828,  2.3594,  0.0776, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:23,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.88 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7500, -4.8438, -1.8203,  0.4277, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2344, -0.1689,  2.0938,  2.0000, -0.7617]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5469, -0.7031,  2.6719, -0.3105, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.7031, -3.0781, -1.1875,  0.8086, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.2188, -2.0938,  0.7734,  1.6875, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281, -2.8281, -0.9766,  2.0938, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -4.6250, -1.3438,  0.7812, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:02:25,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:02:25,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.42 | bwd_microstep: 1874.75 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1873.64 | step_microstep: 2.06
[2025-11-06 18:02:25,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.27 | bwd: 1875.65 | bwd_inner: 1.84 | bwd_allreduce: 1873.68 | step: 2.15
 19%|█▉        | 681/3507 [17:38<1:09:07,  1.47s/it]                                                    {'loss': 1.0965, 'learning_rate': 1.862228915829635e-05, 'epoch': 0.19}
 19%|█▉        | 681/3507 [17:38<1:09:07,  1.47s/it]tensor([[-4.5312, -3.2500,  0.3418,  1.4219, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:25,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.59 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9844, -2.4375, -0.5859,  2.4062, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750, -1.7891,  1.6016,  0.0471, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -4.2188, -1.1797,  0.2949, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -2.9844, -0.1611,  1.4531, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719, -1.6562,  1.6797,  0.4180, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -2.7500, -0.1855,  2.1094, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188, -1.4375,  2.4219,  0.1660, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:02:25,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:02:25,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 83.79 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 82.56 | step_microstep: 1.55
[2025-11-06 18:02:25,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.95 | bwd: 84.87 | bwd_inner: 2.14 | bwd_allreduce: 82.60 | step: 1.64
 19%|█▉        | 682/3507 [17:39<55:16,  1.17s/it]                                                    {'loss': 0.6866, 'learning_rate': 1.8617606633737565e-05, 'epoch': 0.19}
 19%|█▉        | 682/3507 [17:39<55:16,  1.17s/it]tensor([[-2.2188, -1.2812,  1.1953,  3.1094, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0781, -2.3438, -0.0417,  2.5312, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:25,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.99 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.3750e+00, -3.2344e+00,  2.5635e-03,  1.5859e+00, -3.2188e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -2.5781,  0.6836,  0.1436, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.8125, -4.2500, -2.1719,  0.4961, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')tensor([3], device='cuda:0')

tensor([[-3.0938, -1.6719,  1.1875,  0.3301, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344, -2.0312,  0.9180,  1.5859, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -3.2812, -1.2344,  2.1562, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:02:27,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:02:27,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.28 | bwd_microstep: 1707.87 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1706.84 | step_microstep: 2.10
[2025-11-06 18:02:27,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.29 | bwd: 1708.77 | bwd_inner: 1.77 | bwd_allreduce: 1706.88 | step: 2.17
 19%|█▉        | 683/3507 [17:41<1:08:10,  1.45s/it]                                                    {'loss': 0.2468, 'learning_rate': 1.861291675603401e-05, 'epoch': 0.19}
 19%|█▉        | 683/3507 [17:41<1:08:10,  1.45s/it]tensor([[-4.0000, -2.7500,  0.5664,  1.4609, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:27,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.51 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9688, -2.7500,  0.1416,  0.4668, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7969, -3.1562, -0.9805,  1.8828, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.3125,  0.8906,  1.5156, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -3.6406,  0.3359,  0.0728, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1875, -0.5234,  2.2812, -0.4531, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.2969, -0.5078,  2.3906, -0.7031, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5625, -2.8281, -0.5938,  1.6719, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:28,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:02:28,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.58 | bwd_microstep: 29.59 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 28.64 | step_microstep: 1.69
[2025-11-06 18:02:28,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.13 | bwd: 30.46 | bwd_inner: 1.65 | bwd_allreduce: 28.68 | step: 1.77
 20%|█▉        | 684/3507 [17:41<52:28,  1.12s/it]                                                    {'loss': 0.8484, 'learning_rate': 1.860821952918741e-05, 'epoch': 0.2}
 20%|█▉        | 684/3507 [17:41<52:28,  1.12s/it]tensor([[-3.5312, -2.4688,  0.3672,  1.7578, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.6406, -1.4219,  1.4375, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:28,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.44 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -2.4844,  1.3438,  1.1016, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4844, -1.9609, -0.4414,  1.8281, -1.5703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0469,  0.2617,  2.4219,  0.7500, -0.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.4844, -2.6562, -0.1895,  2.0625, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0469, -2.2500,  0.1108,  2.5312, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -4.4688, -0.3105, -0.0723, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:02:28,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:02:28,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.71 | bwd_microstep: 241.60 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 240.52 | step_microstep: 1.60
[2025-11-06 18:02:28,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.17 | bwd: 242.39 | bwd_inner: 1.71 | bwd_allreduce: 240.55 | step: 1.69
 20%|█▉        | 685/3507 [17:42<46:24,  1.01it/s]                                                  {'loss': 0.5486, 'learning_rate': 1.860351495720577e-05, 'epoch': 0.2}
 20%|█▉        | 685/3507 [17:42<46:24,  1.01it/s]tensor([[-2.6562, -2.1250, -0.2969,  2.7812, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.7656,  1.2656, -0.5664, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:29,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.71 | bwd_microstep: 1.52 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -2.1875,  1.8359, -0.7383, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0156, -1.3594,  1.6875,  0.0388, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -2.2031,  1.8359,  0.7148, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -2.0000,  1.0703,  1.0312, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -2.5312,  0.6016,  1.4062, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.7344, -1.3672,  1.4922, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:31,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.19 | optimizer_step: 0.30
[2025-11-06 18:02:31,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.29 | bwd_microstep: 1749.77 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1748.56 | step_microstep: 2.10
[2025-11-06 18:02:31,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.03 | bwd: 1751.30 | bwd_inner: 2.56 | bwd_allreduce: 1748.61 | step: 2.18
 20%|█▉        | 686/3507 [17:44<1:04:44,  1.38s/it]                                                    {'loss': 0.2621, 'learning_rate': 1.859880304410337e-05, 'epoch': 0.2}
 20%|█▉        | 686/3507 [17:44<1:04:44,  1.38s/it]tensor([[-3.4688, -1.5781,  2.4375, -0.0801, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:31,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.59 | bwd_microstep: 1.10 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6719, -1.8594,  1.7266, -0.8086, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.6406, -0.2871,  0.6680, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.6406, -1.1797,  1.4766, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -2.7656, -0.1416,  2.0938, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3750, -0.5898,  2.6094, -0.4570, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.6875,  0.6602,  0.3516, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[4.0625, 4.5000, 4.8750, 6.5625, 3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:31,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:02:31,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.45 | bwd_microstep: 128.19 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 126.97 | step_microstep: 1.71
[2025-11-06 18:02:31,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.06 | bwd: 129.29 | bwd_inner: 2.17 | bwd_allreduce: 127.00 | step: 1.77
 20%|█▉        | 687/3507 [17:45<51:06,  1.09s/it]                                                    {'loss': 0.2679, 'learning_rate': 1.859408379390073e-05, 'epoch': 0.2}
 20%|█▉        | 687/3507 [17:45<51:06,  1.09s/it]tensor([[-3.7656, -2.7500,  0.0422,  1.8828, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -2.8594,  0.4414,  0.9648, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:31,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.28 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-3.1406, -1.8359,  1.3125,  2.0312, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -3.2500, -0.9727,  1.9219, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -1.4844,  1.9609, -0.0811, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1094, -0.3535,  2.5312, -0.3945, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -1.0000,  2.5312, -0.4844, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.4844, -2.9375, -1.0000,  1.8750, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:32,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:02:32,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.83 | bwd_microstep: 1188.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1187.01 | step_microstep: 1.84
[2025-11-06 18:02:32,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.14 | bwd: 1188.88 | bwd_inner: 1.66 | bwd_allreduce: 1187.07 | step: 1.94
 20%|█▉        | 688/3507 [17:46<57:28,  1.22s/it]                                                  {'loss': 0.7312, 'learning_rate': 1.8589357210624647e-05, 'epoch': 0.2}
 20%|█▉        | 688/3507 [17:46<57:28,  1.22s/it]tensor([[-2.9844, -1.8672,  0.7656,  1.3281, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750, -1.3125,  2.7031, -0.5703, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2969, -0.4102,  2.6875, -1.1094, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:33,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.52 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.6875, -2.7500, -0.1748,  1.4297, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5469, -2.3594,  0.7773,  2.3281, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5898,  0.0352,  1.5156,  2.9062, -0.1089]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6875, -2.3438,  0.8906,  1.5781, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8906, -2.9844, -0.4062,  1.7031, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:02:33,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:02:33,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.51 | bwd_microstep: 8.87 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 7.83 | step_microstep: 1.83
[2025-11-06 18:02:33,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.06 | bwd: 9.62 | bwd_inner: 1.64 | bwd_allreduce: 7.85 | step: 1.90
 20%|█▉        | 689/3507 [17:47<46:22,  1.01it/s]                                                  {'loss': 0.4461, 'learning_rate': 1.8584623298308176e-05, 'epoch': 0.2}
 20%|█▉        | 689/3507 [17:47<46:22,  1.01it/s]tensor([[-4.0938, -3.2969, -0.8242,  1.5859, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:33,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.39 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.0806,  1.7812,  4.4688,  0.9805, -0.0796]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281, -2.0469,  1.0156,  1.5859, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -2.1406,  2.0156, -0.6484, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -1.8281,  1.9141,  0.1621, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -2.4375,  0.6172,  1.3359, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -2.8750, -0.8633,  2.1250, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -3.5469, -0.7266,  1.0469, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:02:35,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:02:35,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.48 | bwd_microstep: 2131.83 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 2130.85 | step_microstep: 2.01
[2025-11-06 18:02:35,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.89 | bwd: 2132.70 | bwd_inner: 1.69 | bwd_allreduce: 2130.89 | step: 2.10
 20%|█▉        | 690/3507 [17:49<1:07:11,  1.43s/it]                                                    {'loss': 0.4261, 'learning_rate': 1.8579882060990627e-05, 'epoch': 0.2}
 20%|█▉        | 690/3507 [17:49<1:07:11,  1.43s/it][h264 @ 0xd18f5c0] mmco: unref short failure
[h264 @ 0xd18f5c0] mmco: unref short failure
tensor([[-3.3438, -2.7031, -0.6602,  2.2969, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.1094, -0.0527,  1.0703, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -3.5156,  0.7617, -1.3594, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625, -2.7812, -0.3809,  2.1250, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:36,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.80 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8906, -2.0938,  0.1104,  1.9375, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -3.1250, -0.5156,  1.3750, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -2.4219,  0.3867,  1.7344, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2656, -2.0938,  0.7734,  2.1719, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:36,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:02:36,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.56
[2025-11-06 18:02:36,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.39 | bwd: 3.02 | bwd_inner: 2.03 | bwd_allreduce: 0.87 | step: 1.65
 20%|█▉        | 691/3507 [17:50<54:26,  1.16s/it]                                                    {'loss': 0.323, 'learning_rate': 1.8575133502717545e-05, 'epoch': 0.2}
 20%|█▉        | 691/3507 [17:50<54:26,  1.16s/it]tensor([[-4.4688, -3.0938,  0.5156,  1.4531, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -4.9375, -0.8359, -0.1992, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:36,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.66 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.8750, -2.1562,  1.3281, -0.1001, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062, -1.2656,  1.4766,  2.7031, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4062, -1.9062, -0.1787,  2.9375, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3438, -1.5156,  1.9531, -0.3945, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -3.0000,  1.2500, -0.6484, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -1.6016,  1.8984, -0.9336, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:02:37,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:02:37,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.02 | bwd_microstep: 1112.54 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1111.48 | step_microstep: 1.81
[2025-11-06 18:02:37,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.72 | bwd: 1113.39 | bwd_inner: 1.74 | bwd_allreduce: 1111.51 | step: 1.88
 20%|█▉        | 692/3507 [17:51<59:32,  1.27s/it]                                                  {'loss': 0.5718, 'learning_rate': 1.8570377627540735e-05, 'epoch': 0.2}
 20%|█▉        | 692/3507 [17:51<59:32,  1.27s/it]tensor([[-0.8867, -0.4180,  0.9180,  3.8438, -0.1855]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7188, -2.0312,  0.0603,  2.5781, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -3.0469,  0.5078,  0.1074, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8906,  0.1514,  3.6719,  0.4434, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -3.8750, -2.1094,  1.4453, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:38,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.55 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.0781, -1.5625,  1.3438,  0.2393, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.7656, -1.6953,  1.6719, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -4.8125, -2.3594,  0.8320, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:38,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:02:38,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.90 | bwd_microstep: 1.91 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.70
[2025-11-06 18:02:38,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.47 | bwd: 2.73 | bwd_inner: 1.83 | bwd_allreduce: 0.78 | step: 1.75
 20%|█▉        | 693/3507 [17:52<49:30,  1.06s/it]                                                  {'loss': 0.2139, 'learning_rate': 1.8565614439518246e-05, 'epoch': 0.2}
 20%|█▉        | 693/3507 [17:52<49:30,  1.06s/it]tensor([[-3.9375, -2.1406,  1.6016, -0.0859, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8984, -0.3652,  2.4062,  0.5430, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6562, -2.2656,  0.8711,  1.1484, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.4844, -0.6797,  1.3594, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4453, -0.3652,  1.7109,  1.8672, -0.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8125, -3.2188, -1.2031,  1.4922, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:02:39,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.06 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-2.8594, -0.8867,  2.5156, -0.8125, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9531, -1.0312,  2.5469,  0.0030, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:02:39,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:02:39,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.05 | bwd_microstep: 120.30 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 118.77 | step_microstep: 2.06
[2025-11-06 18:02:39,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 445.13 | bwd: 121.45 | bwd_inner: 2.44 | bwd_allreduce: 118.82 | step: 2.16
 20%|█▉        | 694/3507 [17:53<54:44,  1.17s/it]                                                  {'loss': 0.2709, 'learning_rate': 1.8560843942714363e-05, 'epoch': 0.2}
 20%|█▉        | 694/3507 [17:53<54:44,  1.17s/it]tensor([[-2.3594, -0.7188,  2.0781, -0.3809, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -3.0469,  0.7188,  0.7734, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.7812,  0.7461,  1.1016, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -1.8984,  1.9609, -0.6484, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625, -1.4297,  1.6641, -0.0151, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -2.0469,  0.5586,  2.0312, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -1.6875,  1.8828, -0.7148, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:41,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.43 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4844, -2.9062, -0.9609,  1.9453, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:42,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:02:42,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.71 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.03
[2025-11-06 18:02:42,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 480.09 | bwd: 3.08 | bwd_inner: 2.00 | bwd_allreduce: 0.94 | step: 2.12
 20%|█▉        | 695/3507 [17:55<1:07:37,  1.44s/it]                                                    {'loss': 0.3137, 'learning_rate': 1.85560661411996e-05, 'epoch': 0.2}
 20%|█▉        | 695/3507 [17:55<1:07:37,  1.44s/it]tensor([[-2.9219, -2.5312, -1.0625,  1.9219, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:42,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.51 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9688, -3.5156,  0.3086,  1.0469, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -2.7812,  0.8906,  1.3359, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -2.6094, -0.7109,  2.3594, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.4531,  1.4062,  0.4336, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8750, -1.0000,  2.4688,  0.2051, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1875, -1.0938,  2.8281, -0.7656, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -3.6094, -0.0811,  0.1777, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:42,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:02:42,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.84 | bwd_microstep: 123.89 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 122.75 | step_microstep: 2.57
[2025-11-06 18:02:42,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.39 | bwd: 124.87 | bwd_inner: 1.96 | bwd_allreduce: 122.78 | step: 2.64
 20%|█▉        | 696/3507 [17:56<53:35,  1.14s/it]                                                    {'loss': 0.3999, 'learning_rate': 1.855128103905072e-05, 'epoch': 0.2}
 20%|█▉        | 696/3507 [17:56<53:35,  1.14s/it]tensor([[-2.2969, -0.6992,  2.4688,  0.8203, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9375, -0.9102,  2.2812, -1.0625, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.7812, -2.3906, -1.0625,  1.7188, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.0781,  1.3359, -1.6094, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -2.2500,  0.9609,  1.4062, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.8125,  1.0156,  0.8906, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -0.5898,  2.7656, -0.7070, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:02:44,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 88.09 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-12.4375, -11.0625,  -6.1562,  -3.7500, -10.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:44,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:02:44,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.47 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 2.83
[2025-11-06 18:02:44,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.58 | bwd: 2.92 | bwd_inner: 2.01 | bwd_allreduce: 0.78 | step: 2.91
 20%|█▉        | 697/3507 [17:58<1:05:39,  1.40s/it]                                                    {'loss': 1.0627, 'learning_rate': 1.8546488640350704e-05, 'epoch': 0.2}
 20%|█▉        | 697/3507 [17:58<1:05:39,  1.40s/it]tensor([[-2.9062, -1.8203,  0.6133,  1.0938, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:44,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.74 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6875, -2.3125, -0.9961,  1.9141, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.1875, -0.2451,  2.2188, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -1.7656,  1.8828,  0.7461, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.0000,  0.3672,  1.0312, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8125, -2.8281, -0.0378,  2.0469, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719, -1.9531,  0.1006,  2.6719, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -3.3594, -0.7773,  1.4609, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:02:44,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:02:44,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.35 | bwd_microstep: 103.22 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 102.10 | step_microstep: 2.33
[2025-11-06 18:02:44,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.12 | bwd: 104.11 | bwd_inner: 1.83 | bwd_allreduce: 102.14 | step: 2.41
 20%|█▉        | 698/3507 [17:58<51:32,  1.10s/it]                                                    {'loss': 0.3723, 'learning_rate': 1.8541688949188762e-05, 'epoch': 0.2}
 20%|█▉        | 698/3507 [17:58<51:32,  1.10s/it]tensor([[-5.1250, -3.5938,  0.0071, -0.4414, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -4.0312, -0.8984,  0.9141, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219, -1.2891,  2.0312,  0.7773, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844, -2.5781, -1.1406,  2.0781, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7188, -4.9375, -0.5547, -1.1484, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.8281, -1.4375,  1.7578, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0312, -2.4844, -0.6797,  2.4219, -1.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:45,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.51 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.11
tensor([[-3.4062, -1.7031,  1.8594,  0.9258, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:46,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:02:46,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.47 | bwd_microstep: 2.11 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.73
[2025-11-06 18:02:46,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.90 | bwd: 3.18 | bwd_inner: 2.06 | bwd_allreduce: 0.97 | step: 2.85
 20%|█▉        | 699/3507 [17:59<52:57,  1.13s/it]                                                  {'loss': 0.2489, 'learning_rate': 1.8536881969660326e-05, 'epoch': 0.2}
 20%|█▉        | 699/3507 [17:59<52:57,  1.13s/it]tensor([[-3.6875, -1.9531,  1.6797,  0.2852, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:46,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.66 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5625, -3.9062, -0.1445, -0.7109, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -2.3125,  0.8906,  1.2891, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -2.5156,  1.1250,  0.6680, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -5.1562, -1.7969,  0.3184, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -3.3125,  0.3125,  1.1484, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -2.3906,  1.3281,  0.6094, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9375, -1.4531,  1.5938,  1.2969, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:02:46,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:02:46,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.40 | bwd_microstep: 54.05 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 52.84 | step_microstep: 1.77
[2025-11-06 18:02:46,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.09 | bwd: 54.93 | bwd_inner: 1.92 | bwd_allreduce: 52.88 | step: 1.86
 20%|█▉        | 700/3507 [18:00<43:06,  1.09it/s]                                                  {'loss': 0.6239, 'learning_rate': 1.853206770586705e-05, 'epoch': 0.2}
 20%|█▉        | 700/3507 [18:00<43:06,  1.09it/s]tensor([[-4.9062, -3.3438,  0.3223, -0.0442, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -2.6406,  0.7461, -0.1973, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -2.5312,  0.0728,  2.1406, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -3.4062,  0.5156,  1.1172, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9688, -2.1250,  0.1875,  2.2812, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3438, -1.5391,  0.5156,  2.5781, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -2.5312,  0.0771,  1.7266, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:48,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.01 | bwd_microstep: 1.38 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.07
tensor([[-3.2188, -2.2188,  0.3984,  2.3125, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:48,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.31
[2025-11-06 18:02:48,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.03 | bwd_microstep: 3.82 | bwd_inner_microstep: 2.17 | bwd_allreduce_microstep: 1.51 | step_microstep: 2.35
[2025-11-06 18:02:48,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.07 | bwd: 5.20 | bwd_inner: 3.43 | bwd_allreduce: 1.56 | step: 2.43
 20%|█▉        | 701/3507 [18:02<1:04:58,  1.39s/it]                                                    {'loss': 0.5787, 'learning_rate': 1.8527246161916796e-05, 'epoch': 0.2}
 20%|█▉        | 701/3507 [18:02<1:04:58,  1.39s/it]tensor([[-2.7344, -1.1016,  2.0156,  0.6094, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -3.6094, -1.9609,  1.3750, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -2.1094,  2.1250, -0.6016, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:02:49,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.75 | bwd_microstep: 1.50 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.11
tensor([[-1.4766, -0.5586,  1.5234,  3.1094, -0.8164]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7969, -2.1250,  1.2344, -0.1201, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -3.3906,  0.5078, -1.6797, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -3.3438, -0.3789,  1.1172, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -2.9062,  1.2422, -0.5977, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:49,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.31 | optimizer_step: 0.35
[2025-11-06 18:02:49,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.68 | bwd_microstep: 3.99 | bwd_inner_microstep: 2.27 | bwd_allreduce_microstep: 1.52 | step_microstep: 2.89
[2025-11-06 18:02:49,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.45 | bwd: 5.48 | bwd_inner: 3.67 | bwd_allreduce: 1.56 | step: 3.01
 20%|██        | 702/3507 [18:03<50:50,  1.09s/it]                                                    {'loss': 0.1808, 'learning_rate': 1.852241734192364e-05, 'epoch': 0.2}
 20%|██        | 702/3507 [18:03<50:50,  1.09s/it]tensor([[-3.7812, -2.2812,  1.1172,  1.1797, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.1875, -4.5938, -2.3281,  1.1250, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([3], device='cuda:3')
tensor([[-3.0000, -1.1172,  2.1719, -0.8164, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.6406, -1.7266,  0.8633, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.7031,  0.5469,  1.3359, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -1.2266,  2.0156,  0.3809, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.1719, -0.1602,  3.0156, -0.1709, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:51,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.93 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.2969, -0.9375,  1.6953,  1.5078, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:51,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.51 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:02:51,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.10 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.91 | step_microstep: 4.76
[2025-11-06 18:02:51,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.03 | bwd: 2.89 | bwd_inner: 1.78 | bwd_allreduce: 0.95 | step: 4.86
 20%|██        | 703/3507 [18:05<1:02:49,  1.34s/it]                                                    {'loss': 0.7042, 'learning_rate': 1.8517581250007878e-05, 'epoch': 0.2}
 20%|██        | 703/3507 [18:05<1:02:49,  1.34s/it]tensor([[-3.4062, -1.4062,  2.3438, -0.1865, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0000, -0.5625,  2.1875,  1.4844, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9844, -1.5312,  1.5391,  1.3516, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -1.8047,  1.6797, -0.2754, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.2500, -1.4688,  0.5781,  2.9062, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -2.0938,  2.0625,  0.1807, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:52,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.11 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0312, -1.1953,  0.7656,  2.8750, -1.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -2.6406, -0.1992,  1.2188, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:52,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:02:52,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.55 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.02
[2025-11-06 18:02:52,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.69 | bwd: 2.99 | bwd_inner: 1.94 | bwd_allreduce: 0.92 | step: 2.10
 20%|██        | 704/3507 [18:06<1:05:07,  1.39s/it]                                                    {'loss': 0.9556, 'learning_rate': 1.8512737890295996e-05, 'epoch': 0.2}
 20%|██        | 704/3507 [18:06<1:05:07,  1.39s/it]tensor([[-4.5938, -3.2812,  0.1064,  1.1562, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625, -2.6406, -0.1943,  1.4375, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -2.1562,  1.2578, -0.2734, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -2.2969,  0.3086,  1.9297, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7656, -1.4062,  1.5156,  1.7656, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -2.1562,  1.4531,  0.9102, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -3.5469, -0.7773,  1.8281, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:53,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.63 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9844, -2.4375,  0.9883,  0.6953, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:53,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:02:53,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.48 | bwd_microstep: 1.80 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.91
[2025-11-06 18:02:53,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.11 | bwd: 2.71 | bwd_inner: 1.80 | bwd_allreduce: 0.80 | step: 1.98
 20%|██        | 705/3507 [18:07<56:24,  1.21s/it]                                                    {'loss': 0.8184, 'learning_rate': 1.850788726692069e-05, 'epoch': 0.2}
 20%|██        | 705/3507 [18:07<56:24,  1.21s/it]tensor([[-3.0312, -1.0938,  2.1875, -0.2754, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -3.2500,  0.4102,  0.4395, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.2812,  1.7812, -1.1172, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.1719,  0.5117,  0.9727, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6406, -1.9375, -0.0383,  2.2500, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2656, -2.8281, -1.1797,  1.8516, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:54,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.93 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6562, -2.4062,  0.5508,  1.1797, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3125, -2.0312,  0.8945,  1.4844, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:56,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.19 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:02:56,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 1415.21 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1414.20 | step_microstep: 3.66
[2025-11-06 18:02:56,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.56 | bwd: 1416.11 | bwd_inner: 1.73 | bwd_allreduce: 1414.24 | step: 3.75
 20%|██        | 706/3507 [18:10<1:17:45,  1.67s/it]                                                    {'loss': 0.4521, 'learning_rate': 1.8503029384020847e-05, 'epoch': 0.2}
 20%|██        | 706/3507 [18:10<1:17:45,  1.67s/it]tensor([[-3.7812, -3.2969, -1.4453,  1.9766, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:02:56,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.83 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8125, -2.8438, -0.1660,  2.0938, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -4.2188, -1.5312,  0.8711, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7656, -1.2812,  1.6719,  0.8516, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3125, -0.4375,  2.5781,  0.1289, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.8125, -2.2344,  1.3203,  1.4141, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -3.3906,  0.3125,  0.0737, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -4.1250, -0.4863,  0.1689, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:56,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:02:56,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 91.58 | bwd_microstep: 175.50 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 174.10 | step_microstep: 1.59
[2025-11-06 18:02:56,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 236.43 | bwd: 176.38 | bwd_inner: 2.09 | bwd_allreduce: 174.14 | step: 1.69
 20%|██        | 707/3507 [18:10<1:00:38,  1.30s/it]                                                    {'loss': 0.7892, 'learning_rate': 1.8498164245741558e-05, 'epoch': 0.2}
 20%|██        | 707/3507 [18:10<1:00:38,  1.30s/it]tensor([[-5.4062, -4.3750, -1.3516,  1.0078, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.3281,  0.2441,  0.9844, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -2.8594, -0.3633,  1.7891, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -1.6094,  2.3438, -0.3281, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -2.6094,  1.3438, -0.4316, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:57,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.53 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -3.7656, -1.1719,  0.8867, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.5625,  0.6641,  0.5898, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7969, -0.7734,  2.8281, -0.3418, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:02:57,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.24 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:02:57,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.23 | bwd_microstep: 120.85 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 119.77 | step_microstep: 3.23
[2025-11-06 18:02:57,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 268.68 | bwd: 121.66 | bwd_inner: 1.70 | bwd_allreduce: 119.81 | step: 3.33
 20%|██        | 708/3507 [18:11<55:21,  1.19s/it]                                                    {'loss': 0.9554, 'learning_rate': 1.8493291856234093e-05, 'epoch': 0.2}
 20%|██        | 708/3507 [18:11<55:21,  1.19s/it]tensor([[-4.1875, -3.3750, -0.8828,  1.9844, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.2812, -2.3281,  1.5000, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.6875, -3.9844,  0.1777,  0.1250, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.0625,  0.2969,  0.1494, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:02:57,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.53 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.9688, -2.5781,  0.4434,  0.5195, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3125, -2.5312, -0.2559,  2.1719, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -3.5156,  0.4688,  0.0635, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -3.3750,  0.7305, -0.2344, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:02:58,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:02:58,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.77 | bwd_microstep: 661.35 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 660.26 | step_microstep: 2.27
[2025-11-06 18:02:58,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.33 | bwd: 662.28 | bwd_inner: 1.85 | bwd_allreduce: 660.31 | step: 2.36
 20%|██        | 709/3507 [18:12<53:28,  1.15s/it]                                                  {'loss': 1.0041, 'learning_rate': 1.848841221965592e-05, 'epoch': 0.2}
 20%|██        | 709/3507 [18:12<53:28,  1.15s/it]tensor([[1.0703, 1.8359, 3.0781, 4.1250, 1.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -2.4844,  0.4531,  1.3281, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.4531, -1.1562,  1.2266, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.8438,  0.2520,  1.7812, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -2.8281, -0.2061,  2.1094, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -2.7500, -0.1235,  1.4766, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:00,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.52 | bwd_microstep: 2.40 | bwd_inner_microstep: 2.06 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.20
tensor([[-3.2500, -2.0938,  0.8086,  2.2500, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -3.7344, -0.8438,  0.1055, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:01,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 18:03:01,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.06 | bwd_microstep: 922.38 | bwd_inner_microstep: 2.45 | bwd_allreduce_microstep: 919.74 | step_microstep: 2.64
[2025-11-06 18:03:01,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.60 | bwd: 924.76 | bwd_inner: 4.57 | bwd_allreduce: 919.85 | step: 2.84
 20%|██        | 710/3507 [18:15<1:17:39,  1.67s/it]                                                    {'loss': 0.799, 'learning_rate': 1.8483525340170687e-05, 'epoch': 0.2}
 20%|██        | 710/3507 [18:15<1:17:39,  1.67s/it]tensor([[-2.6250, -1.4453,  1.1094,  2.1719, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.7500, -1.7969,  1.5781, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2188, -1.3203,  0.9219,  2.6406, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.2812,  0.5312,  0.4043, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:01,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 268.48 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.9297,  0.4980,  2.8906,  2.4219, -0.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -2.6250,  0.7773,  0.0562, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -1.7812,  2.0312, -0.8164, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344, -0.3691,  2.4844, -0.5898, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:03:02,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:03:02,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.21 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.97
[2025-11-06 18:03:02,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 494.71 | bwd: 2.99 | bwd_inner: 1.95 | bwd_allreduce: 0.90 | step: 3.05
 20%|██        | 711/3507 [18:15<1:01:57,  1.33s/it]                                                    {'loss': 0.6783, 'learning_rate': 1.8478631221948217e-05, 'epoch': 0.2}
 20%|██        | 711/3507 [18:15<1:01:57,  1.33s/it]tensor([[-4.0938, -1.8750,  2.3438, -0.8672, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -2.3594,  1.0625,  0.1133, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -1.8516,  1.1719,  0.3105, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:02,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.24 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-2.5469, -0.7930,  2.1406,  0.0292, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.2500, -1.3047,  2.0000,  0.1426, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.6719, -2.9531, -0.8555,  1.7578, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -3.1562, -1.1172,  1.7031, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -2.0156,  0.7305,  1.7422, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:03:03,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.25 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:03:03,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.66 | bwd_microstep: 793.69 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 792.52 | step_microstep: 3.31
[2025-11-06 18:03:03,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.90 | bwd: 794.77 | bwd_inner: 1.99 | bwd_allreduce: 792.60 | step: 3.47
 20%|██        | 712/3507 [18:17<59:18,  1.27s/it]                                                    {'loss': 0.9951, 'learning_rate': 1.8473729869164517e-05, 'epoch': 0.2}
 20%|██        | 712/3507 [18:17<59:18,  1.27s/it]tensor([[-3.3281, -1.6484,  1.5859,  0.3398, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8125, -0.1230,  2.5312, -0.1650, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.8594, -2.6250,  0.4297,  1.8984, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:03,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.20 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9062, -1.5000,  1.5938,  1.9453, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.0000,  0.8477, -1.0156, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250, -2.5156,  0.1445,  1.0391, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -2.4219,  1.5234, -0.3535, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -1.1797,  2.5625, -1.1953, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:03,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:03:03,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.98 | bwd_microstep: 218.62 | bwd_inner_microstep: 2.70 | bwd_allreduce_microstep: 215.72 | step_microstep: 2.60
[2025-11-06 18:03:03,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.24 | bwd: 219.56 | bwd_inner: 3.59 | bwd_allreduce: 215.75 | step: 2.69
 20%|██        | 713/3507 [18:17<50:20,  1.08s/it]                                                  {'loss': 0.5847, 'learning_rate': 1.8468821286001768e-05, 'epoch': 0.2}
 20%|██        | 713/3507 [18:17<50:20,  1.08s/it]tensor([[-4.5625, -3.9844, -1.8125,  1.6406, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6094, -0.6602,  1.4375,  2.8750, -0.9492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -2.2656,  1.6172,  1.3125, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:04,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.82 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2969, -1.2344,  2.4219, -0.5820, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.0938, -0.3984,  1.7500, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0000, -1.1875,  1.7422, -0.2656, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -1.9688,  1.7734, -1.7812, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -2.8594,  0.6875,  0.6367, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:05,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:03:05,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.48 | bwd_microstep: 882.88 | bwd_inner_microstep: 1.62 | bwd_allreduce_microstep: 881.14 | step_microstep: 2.94
[2025-11-06 18:03:05,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.32 | bwd: 883.72 | bwd_inner: 2.39 | bwd_allreduce: 881.17 | step: 3.02
 20%|██        | 714/3507 [18:19<56:45,  1.22s/it]                                                  {'loss': 0.7031, 'learning_rate': 1.846390547664831e-05, 'epoch': 0.2}
 20%|██        | 714/3507 [18:19<56:45,  1.22s/it]tensor([[-3.1719, -2.0781,  0.6328,  2.4844, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:05,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.03 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.8438, -1.5547,  1.0703,  1.1250, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2188, -2.1562,  0.4824,  1.9844, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.6250, -0.9844,  1.9531, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969, -1.7891,  1.2656,  0.3418, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6250, -1.7266,  1.9609,  0.2617, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.1719, -0.4805,  1.1484, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -2.6250,  0.3086,  0.3281, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:06,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:03:06,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.46 | bwd_microstep: 1018.03 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1016.74 | step_microstep: 2.36
[2025-11-06 18:03:06,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 268.52 | bwd: 1019.30 | bwd_inner: 2.37 | bwd_allreduce: 1016.79 | step: 2.45
 20%|██        | 715/3507 [18:20<58:09,  1.25s/it]                                                  {'loss': 0.552, 'learning_rate': 1.8458982445298656e-05, 'epoch': 0.2}
 20%|██        | 715/3507 [18:20<58:09,  1.25s/it]tensor([[-3.7656, -2.0312,  1.6719,  1.0625, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -1.6250,  0.4414,  1.6875, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:07,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.44 | bwd_microstep: 2.19 | bwd_inner_microstep: 1.90 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.21
tensor([[-3.0156, -1.3047,  1.8359,  0.5859, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594, -3.0781, -0.7695,  2.1406, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7656, -1.4844,  1.1719,  1.3438, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2344, -1.6484,  1.4375,  0.8008, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -3.2344, -1.3125,  1.8281, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -2.9219,  0.6406, -0.7188, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:08,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:03:08,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.39 | bwd_microstep: 1072.23 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1071.04 | step_microstep: 2.61
[2025-11-06 18:03:08,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.82 | bwd: 1074.41 | bwd_inner: 3.04 | bwd_allreduce: 1071.13 | step: 2.82
 20%|██        | 716/3507 [18:22<1:01:08,  1.31s/it]                                                    {'loss': 0.4031, 'learning_rate': 1.8454052196153483e-05, 'epoch': 0.2}
 20%|██        | 716/3507 [18:22<1:01:08,  1.31s/it]tensor([[-2.6406, -0.6133,  2.9062,  0.4160, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0312, -2.3438, -0.3906,  2.2812, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9297, -1.4141,  0.1729,  3.4531, -1.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1250, -1.6875,  1.1094,  0.3691, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:08,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.49 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.7500, -2.0781, -0.2070,  2.3281, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -1.9297,  1.7422, -1.5703, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8906,  0.0708,  3.2188,  0.1309, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4531, -2.4375,  0.1816,  1.8359, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:09,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:03:09,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.98 | bwd_microstep: 1305.31 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1304.25 | step_microstep: 1.88
[2025-11-06 18:03:09,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.50 | bwd: 1306.24 | bwd_inner: 1.82 | bwd_allreduce: 1304.30 | step: 1.95
 20%|██        | 717/3507 [18:23<1:05:56,  1.42s/it]                                                    {'loss': 0.1505, 'learning_rate': 1.8449114733419626e-05, 'epoch': 0.2}
 20%|██        | 717/3507 [18:23<1:05:56,  1.42s/it]tensor([[-4.1250, -2.0625,  1.5469, -1.2734, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4062, -2.1562,  0.4551,  0.8555, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:10,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.26 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.9062, -0.9609,  2.0312, -0.6172, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -2.0625,  1.6641, -0.4648, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -2.7969,  0.7773,  0.5859, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -2.4062,  0.9023,  1.3906, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -2.2188,  0.3184,  1.8125, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.1875, -6.6875, -2.4375, -1.0859, -6.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:11,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:03:11,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.10 | bwd_microstep: 1173.57 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1172.52 | step_microstep: 1.98
[2025-11-06 18:03:11,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.39 | bwd: 1174.40 | bwd_inner: 1.71 | bwd_allreduce: 1172.56 | step: 2.07
 20%|██        | 718/3507 [18:25<1:07:02,  1.44s/it]                                                    {'loss': 0.5539, 'learning_rate': 1.844417006131007e-05, 'epoch': 0.2}
 20%|██        | 718/3507 [18:25<1:07:02,  1.44s/it]tensor([[-4.2188, -1.9766,  2.1719, -0.3652, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3398,  0.2246,  1.5000,  3.9531,  0.2217]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.8359, 1.6172, 2.9062, 4.2812, 1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:11,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.28 | bwd_microstep: 1.50 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.17
tensor([[-1.8828, -0.2949,  2.0469, -0.1875, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -2.5156,  0.8047,  1.0938, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -4.8438, -1.0625, -1.7969, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0781, -1.7578,  1.0547,  1.6016, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2812, -1.8047,  1.1953,  0.7344, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:11,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.40 | optimizer_step: 0.36
[2025-11-06 18:03:11,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.36 | bwd_microstep: 58.98 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 57.36 | step_microstep: 3.64
[2025-11-06 18:03:11,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.71 | bwd: 60.47 | bwd_inner: 2.74 | bwd_allreduce: 57.43 | step: 3.82
 21%|██        | 719/3507 [18:25<54:23,  1.17s/it]                                                    {'loss': 0.3538, 'learning_rate': 1.8439218184043953e-05, 'epoch': 0.21}
 21%|██        | 719/3507 [18:25<54:23,  1.17s/it]tensor([[-2.9219, -1.1406,  1.8984,  0.0684, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:12,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.39 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.0938, -2.0469,  0.3398,  1.4219, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -2.9375,  1.3906, -1.6875, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.0000, -6.9688, -3.5000, -0.9922, -6.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.0312,  0.4219,  1.3750, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -2.6719,  1.2344,  0.5547, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0000, -2.0781,  1.4844, -0.2266, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594, -1.2500,  2.5156,  0.0089, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:14,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:03:14,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.86 | bwd_microstep: 1823.03 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1821.97 | step_microstep: 2.92
[2025-11-06 18:03:14,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.25 | bwd: 1824.08 | bwd_inner: 1.92 | bwd_allreduce: 1822.02 | step: 3.01
 21%|██        | 720/3507 [18:28<1:17:47,  1.67s/it]                                                    {'loss': 0.2225, 'learning_rate': 1.8434259105846574e-05, 'epoch': 0.21}
 21%|██        | 720/3507 [18:28<1:17:47,  1.67s/it]tensor([[1.8359, 2.4062, 3.2812, 5.4375, 2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -0.9922,  2.3125, -0.9062, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -3.3125, -0.7930,  1.6016, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:15,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.03 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-4.6875, -3.5781, -0.7031,  0.8789, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -2.5625,  0.3750,  1.4219, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625, -2.5156,  0.1738,  1.8594, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9531, -2.4219, -0.6914,  2.2344, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -2.8125, -0.7031,  2.3281, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:18,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:03:18,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.69 | bwd_microstep: 2950.55 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2949.46 | step_microstep: 2.36
[2025-11-06 18:03:18,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.73 | bwd: 2951.62 | bwd_inner: 1.93 | bwd_allreduce: 2949.52 | step: 2.49
 21%|██        | 721/3507 [18:32<1:41:25,  2.18s/it]                                                    {'loss': 0.5009, 'learning_rate': 1.842929283094935e-05, 'epoch': 0.21}
 21%|██        | 721/3507 [18:32<1:41:25,  2.18s/it]tensor([[-2.8906, -1.8750,  0.6836,  2.6250, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6328, -1.1953, -0.0104,  2.7500, -0.7930]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -2.4531,  0.7891,  0.7500, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:18,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.98 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0469, -1.9844,  0.4043,  1.7500, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9453, -1.6172, -0.6016,  2.1875, -1.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.3125, -2.4062, -0.1465,  1.1719, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -2.8438,  1.2656,  0.2080, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219, -2.6406, -0.3906,  2.2188, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:18,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:03:18,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.02 | bwd_microstep: 273.64 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 272.62 | step_microstep: 1.42
[2025-11-06 18:03:18,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.02 | bwd: 274.45 | bwd_inner: 1.67 | bwd_allreduce: 272.66 | step: 1.49
 21%|██        | 722/3507 [18:32<1:19:37,  1.72s/it]                                                    {'loss': 0.8411, 'learning_rate': 1.842431936358987e-05, 'epoch': 0.21}
 21%|██        | 722/3507 [18:32<1:19:37,  1.72s/it]tensor([[-3.5000, -1.8984,  1.2734,  0.5742, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594, -1.9531,  0.9453,  0.9883, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0000, -1.2422,  1.9531,  0.7188, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.1875, -6.1562, -1.2734, -2.1406, -6.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:19,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.64 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[-3.4062, -1.0859,  2.7969, -0.3672, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469, -2.3594,  0.4043,  1.7734, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -1.6562,  1.7422,  0.0869, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1875, -2.3438,  1.5391,  1.0703, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:19,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:03:19,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.11 | bwd_microstep: 65.30 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 63.82 | step_microstep: 1.47
[2025-11-06 18:03:19,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.77 | bwd: 66.28 | bwd_inner: 2.30 | bwd_allreduce: 63.85 | step: 1.54
 21%|██        | 723/3507 [18:33<1:02:49,  1.35s/it]                                                    {'loss': 0.4171, 'learning_rate': 1.841933870801183e-05, 'epoch': 0.21}
 21%|██        | 723/3507 [18:33<1:02:49,  1.35s/it]tensor([[-3.2812, -2.2500,  0.2676,  1.7500, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:19,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.29 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -3.9219, -0.5938,  0.5977, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -2.5469,  0.6094,  1.2734, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6875, -2.6406,  0.0928,  2.3438, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -5.0938, -2.3125,  0.5781, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.0781, -1.9297,  0.8008,  2.2969, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -2.7188,  1.2344,  0.3965, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -2.2969,  1.0781,  1.0078, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:20,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:03:20,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.18 | bwd_microstep: 413.02 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 411.80 | step_microstep: 2.02
[2025-11-06 18:03:20,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.49 | bwd: 413.99 | bwd_inner: 2.02 | bwd_allreduce: 411.84 | step: 2.10
 21%|██        | 724/3507 [18:33<54:46,  1.18s/it]                                                    {'loss': 0.9374, 'learning_rate': 1.841435086846508e-05, 'epoch': 0.21}
 21%|██        | 724/3507 [18:33<54:46,  1.18s/it]tensor([[-3.5312, -2.0938,  0.8828,  1.1719, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062, -2.4844, -0.0913,  1.8203, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9375, -1.3047,  0.3984,  2.9844, -1.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406, -0.5703,  2.8906,  0.4492, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:20,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.03 | bwd_microstep: 1.45 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.4375, -3.2656, -0.3164,  0.9961, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -2.5000,  0.9961,  0.4980, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -2.8594,  1.3516,  0.1631, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281, -1.5312,  2.5156, -0.8359, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:20,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.30 | optimizer_step: 0.28
[2025-11-06 18:03:20,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.67 | bwd_microstep: 104.35 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 103.45 | step_microstep: 3.60
[2025-11-06 18:03:20,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 460.74 | bwd: 105.81 | bwd_inner: 2.12 | bwd_allreduce: 103.51 | step: 3.69
 21%|██        | 725/3507 [18:34<46:57,  1.01s/it]                                                  {'loss': 0.3606, 'learning_rate': 1.8409355849205597e-05, 'epoch': 0.21}
 21%|██        | 725/3507 [18:34<46:57,  1.01s/it]tensor([[-4.0625, -3.0938, -0.4102,  2.1250, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -2.2500,  1.2578,  0.4199, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -2.8906, -0.1455,  1.5391, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3281, -2.5625, -0.4160,  1.7109, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6875, -3.1719, -1.2266,  2.0000, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:21,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.54 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.1250, -3.3750, -1.0469,  2.0625, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-2.0625, -0.0884,  3.2500,  0.7461, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344, -1.3281,  1.8516, -0.5938, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:22,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.25 | optimizer_step: 0.40
[2025-11-06 18:03:22,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.45 | bwd_microstep: 781.02 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 779.63 | step_microstep: 2.99
[2025-11-06 18:03:22,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.00 | bwd: 782.07 | bwd_inner: 2.18 | bwd_allreduce: 779.69 | step: 3.09
 21%|██        | 726/3507 [18:36<59:00,  1.27s/it]                                                  {'loss': 0.7672, 'learning_rate': 1.8404353654495478e-05, 'epoch': 0.21}
 21%|██        | 726/3507 [18:36<59:00,  1.27s/it]tensor([[-3.8438, -3.3125, -1.5781,  1.2344, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -2.0312,  2.0000, -0.6602, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -2.3750,  0.7148,  0.6758, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:22,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.73 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.5781, -1.3047,  1.2188,  1.3359, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.3320, 1.2812, 2.6562, 3.5469, 0.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2656, -0.4043,  2.5469,  0.3887, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -2.9531,  0.9805, -0.2148, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -3.8906, -0.2988,  1.0781, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:24,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.33 | optimizer_step: 0.42
[2025-11-06 18:03:24,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.95 | bwd_microstep: 1708.19 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1707.02 | step_microstep: 3.53
[2025-11-06 18:03:24,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 427.70 | bwd: 1709.14 | bwd_inner: 1.89 | bwd_allreduce: 1707.09 | step: 3.62
 21%|██        | 727/3507 [18:38<1:11:36,  1.55s/it]                                                    {'loss': 0.5339, 'learning_rate': 1.839934428860294e-05, 'epoch': 0.21}
 21%|██        | 727/3507 [18:38<1:11:36,  1.55s/it]tensor([[-2.7344, -0.4961,  2.8594, -0.9492, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7500, -1.9844,  1.3281, -0.1543, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9062, -1.9688,  0.1768,  1.3359, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6250, -1.0234,  1.8906,  1.0547, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:03:25,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.82 | bwd_microstep: 1.53 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-4.9375, -3.2656,  0.6016,  0.5547, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3516,  1.0703,  3.1875,  2.2344, -0.0928]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.6562, -1.3281,  1.4531, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -2.9375, -0.9453,  1.9375, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:25,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.29
[2025-11-06 18:03:25,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.09 | bwd_microstep: 3.19 | bwd_inner_microstep: 1.57 | bwd_allreduce_microstep: 1.45 | step_microstep: 2.20
[2025-11-06 18:03:25,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.91 | bwd: 4.70 | bwd_inner: 2.96 | bwd_allreduce: 1.50 | step: 2.33
 21%|██        | 728/3507 [18:39<55:56,  1.21s/it]                                                    {'loss': 1.2039, 'learning_rate': 1.8394327755802334e-05, 'epoch': 0.21}
 21%|██        | 728/3507 [18:39<55:56,  1.21s/it]tensor([[-4.0000, -2.3125,  1.2188,  0.1128, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031, -1.6016,  1.4375,  0.9297, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.2812, -0.7109,  1.4375, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -2.1875,  0.6055,  1.7109, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2500, -4.8438, -1.0859, -0.0215, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -3.3438, -0.8867,  1.1719, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:25,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.06 | bwd_microstep: 2.19 | bwd_inner_microstep: 1.91 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.19
tensor([[-4.0625, -2.3750,  1.3047,  1.0312, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2969, -2.7500, -0.8008,  2.5312, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:03:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.49 | bwd_microstep: 1660.78 | bwd_inner_microstep: 2.49 | bwd_allreduce_microstep: 1658.17 | step_microstep: 7.18
[2025-11-06 18:03:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 455.57 | bwd: 1662.96 | bwd_inner: 4.43 | bwd_allreduce: 1658.26 | step: 7.37
 21%|██        | 729/3507 [18:41<1:17:19,  1.67s/it]                                                    {'loss': 0.2981, 'learning_rate': 1.838930406037411e-05, 'epoch': 0.21}
 21%|██        | 729/3507 [18:41<1:17:19,  1.67s/it]tensor([[-2.7500, -0.8555,  2.4219,  1.0000, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:28,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.24 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.8750, -3.2656,  0.2676, -0.1533, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -2.1562,  0.4023,  1.8203, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.1406,  2.0781, -0.8711, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -1.8672,  2.3594, -0.4824, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7500e+00, -5.8438e+00, -2.6875e+00, -5.1270e-03, -5.1562e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7188, -2.2500,  0.6016,  0.3184, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9844, -1.3750,  1.6016,  0.7539, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:28,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:03:28,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.80 | bwd_microstep: 93.30 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 92.20 | step_microstep: 2.86
[2025-11-06 18:03:28,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.03 | bwd: 94.28 | bwd_inner: 1.90 | bwd_allreduce: 92.24 | step: 2.95
 21%|██        | 730/3507 [18:42<1:01:13,  1.32s/it]                                                    {'loss': 0.4807, 'learning_rate': 1.838427320660484e-05, 'epoch': 0.21}
 21%|██        | 730/3507 [18:42<1:01:13,  1.32s/it]tensor([[-4.4688, -2.9844,  0.3086,  0.9492, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8750, -1.1953,  1.8516,  0.7578, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:28,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.94 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4688, -2.6719, -0.3086,  2.3750, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.2969, -0.2109,  0.8242, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3281, -0.8242,  1.8516,  0.7539, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0000, -2.0312,  1.7656, -0.0962, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -1.5938,  1.4219, -0.4219, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0469, -1.3438,  1.9531,  1.4141, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:29,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:03:29,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.70 | bwd_microstep: 1137.04 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1135.83 | step_microstep: 2.82
[2025-11-06 18:03:29,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.67 | bwd: 1137.87 | bwd_inner: 1.84 | bwd_allreduce: 1135.87 | step: 2.88
 21%|██        | 731/3507 [18:43<1:03:41,  1.38s/it]                                                    {'loss': 0.3856, 'learning_rate': 1.83792351987872e-05, 'epoch': 0.21}
 21%|██        | 731/3507 [18:43<1:03:41,  1.38s/it]tensor([[-1.1406,  0.7148,  3.2812,  0.5430, -0.9258]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.8750, -1.6016,  1.1797,  2.1875, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:30,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.30 | bwd_microstep: 1.19 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.1094, -0.0203,  2.1562,  3.3906, -0.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -1.8750,  1.5938,  0.4414, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9531, -2.2500, -0.3105,  1.7344, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8750, -2.4688,  0.6914,  1.3047, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -1.3672,  1.7578,  0.5938, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[6.5938, 7.6562, 7.8750, 7.1250, 5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:30,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:03:30,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.57 | bwd_microstep: 480.79 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 479.66 | step_microstep: 2.43
[2025-11-06 18:03:30,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.87 | bwd: 481.98 | bwd_inner: 2.16 | bwd_allreduce: 479.70 | step: 2.51
 21%|██        | 732/3507 [18:44<55:33,  1.20s/it]                                                    {'loss': 0.8711, 'learning_rate': 1.8374190041219964e-05, 'epoch': 0.21}
 21%|██        | 732/3507 [18:44<55:33,  1.20s/it]tensor([[-3.5938, -2.5781, -0.1885,  0.9922, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -3.3906, -1.5781,  1.7109, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:30,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.92 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3438, -2.1875,  1.9922, -0.1011, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -2.8750, -0.7422,  2.2188, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -1.9766,  2.3125, -0.6484, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9219, -0.7383,  2.4375, -1.0859, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.2500, -4.1250, -1.0156,  1.0234, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -2.3594,  0.4395,  1.5703, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:03:31,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.35 | optimizer_step: 0.44
[2025-11-06 18:03:31,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.33 | bwd_microstep: 695.94 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 694.66 | step_microstep: 3.67
[2025-11-06 18:03:31,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 262.28 | bwd: 697.00 | bwd_inner: 2.07 | bwd_allreduce: 694.73 | step: 3.77
 21%|██        | 733/3507 [18:45<52:42,  1.14s/it]                                                  {'loss': 0.5428, 'learning_rate': 1.836913773820802e-05, 'epoch': 0.21}
 21%|██        | 733/3507 [18:45<52:42,  1.14s/it]tensor([[-3.0000, -1.1797,  2.0469, -0.1768, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -2.8906, -0.0884,  1.4453, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:31,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.53 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-2.7500, -1.5938,  0.7227,  1.6328, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.6719,  0.7344, -0.1035, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -2.7344,  0.7344,  0.2305, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -3.7344, -1.8203,  1.2969, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.9219, -1.8281,  2.0312, -0.3613, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -3.4844, -1.5000,  1.4141, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:33,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.35 | optimizer_step: 0.39
[2025-11-06 18:03:33,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.64 | bwd_microstep: 1325.70 | bwd_inner_microstep: 2.31 | bwd_allreduce_microstep: 1323.08 | step_microstep: 3.44
[2025-11-06 18:03:33,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.22 | bwd: 1326.85 | bwd_inner: 3.39 | bwd_allreduce: 1323.10 | step: 3.55
 21%|██        | 734/3507 [18:47<1:00:01,  1.30s/it]                                                    {'loss': 0.8307, 'learning_rate': 1.8364078294062347e-05, 'epoch': 0.21}
 21%|██        | 734/3507 [18:47<1:00:01,  1.30s/it]tensor([[-4.5938, -3.2500, -0.0295,  0.3789, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:33,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 56.58 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5000, -0.6875,  2.3438,  0.2930, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -2.9375, -0.0894,  1.4141, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -2.0156,  0.8516,  1.1641, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -3.2031,  0.6055,  0.5508, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -3.9844, -1.0938,  0.8086, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -2.2031,  1.1328,  1.1484, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5781, -1.1172,  1.7891,  1.9531, -1.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:03:35,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.85 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:03:35,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.61 | bwd_microstep: 1460.31 | bwd_inner_microstep: 1.74 | bwd_allreduce_microstep: 1458.45 | step_microstep: 4.91
[2025-11-06 18:03:35,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 216.17 | bwd: 1461.18 | bwd_inner: 2.54 | bwd_allreduce: 1458.49 | step: 4.99
 21%|██        | 735/3507 [18:48<1:05:43,  1.42s/it]                                                    {'loss': 0.6967, 'learning_rate': 1.835901171310001e-05, 'epoch': 0.21}
 21%|██        | 735/3507 [18:48<1:05:43,  1.42s/it]tensor([[-3.3750, -1.1250,  2.5312, -1.0938, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:35,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.30 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9219, -3.2656, -0.9883,  2.2969, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -3.0938, -1.2578,  2.0625, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3125, -2.4844, -0.1465,  1.9297, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -2.0938, -0.0334,  2.7812, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -2.2812, -0.6133,  1.6016, -1.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -5.0625, -1.1641,  0.7578, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -3.7344,  0.1338, -0.3613, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:35,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:03:35,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.88 | bwd_microstep: 80.38 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 79.37 | step_microstep: 2.77
[2025-11-06 18:03:35,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.20 | bwd: 81.34 | bwd_inner: 1.79 | bwd_allreduce: 79.41 | step: 2.86
 21%|██        | 736/3507 [18:49<52:05,  1.13s/it]                                                    {'loss': 0.1484, 'learning_rate': 1.8353937999644183e-05, 'epoch': 0.21}
 21%|██        | 736/3507 [18:49<52:05,  1.13s/it]tensor([[-2.9688, -0.9180,  2.5000, -0.1021, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -3.9219, -0.8164,  0.7617, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -2.0781,  0.5430,  1.7422, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:35,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.23 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9531, -2.2031, -0.1133,  1.8594, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.8594,  0.3223,  1.8828, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9219, -1.6172,  2.4688, -0.3105, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6406, -1.3750,  1.0703,  1.3750, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4219, -0.5195,  2.7969,  1.2891, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:36,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.37 | optimizer_step: 0.35
[2025-11-06 18:03:36,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.54 | bwd_microstep: 834.62 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 833.60 | step_microstep: 3.45
[2025-11-06 18:03:36,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.78 | bwd: 835.46 | bwd_inner: 1.59 | bwd_allreduce: 833.66 | step: 3.55
 21%|██        | 737/3507 [18:50<52:46,  1.14s/it]                                                  {'loss': 0.4369, 'learning_rate': 1.8348857158024102e-05, 'epoch': 0.21}
 21%|██        | 737/3507 [18:50<52:46,  1.14s/it]tensor([[-3.3906, -2.3125,  0.4551,  2.5781, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9062, -1.5859, -0.4395,  2.5312, -0.9961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.5938, -0.7773,  2.5156,  1.2812, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -2.3438,  0.6680,  0.5820, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:36,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.81 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.2500,  0.4590,  2.5312,  0.3555, -1.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.5000, -0.6992,  2.1875, -0.4668, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -1.9375,  2.1250, -0.0186, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5625, -2.0781,  0.9297,  0.4316, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:37,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:03:37,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.61 | bwd_microstep: 159.00 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 157.93 | step_microstep: 2.37
[2025-11-06 18:03:37,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.44 | bwd: 159.98 | bwd_inner: 1.88 | bwd_allreduce: 157.97 | step: 2.46
 21%|██        | 738/3507 [18:51<44:22,  1.04it/s]                                                  {'loss': 0.9841, 'learning_rate': 1.8343769192575096e-05, 'epoch': 0.21}
 21%|██        | 738/3507 [18:51<44:22,  1.04it/s]tensor([[-3.8438, -2.7812, -0.0118,  1.6094, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -2.7344,  0.5430,  0.1260, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -5.5625, -2.5625,  0.3281, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5156, -2.3125,  0.3574,  1.5312, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:03:37,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.57 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5312, -1.8438,  0.0264,  2.4375, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9531, -0.6797,  2.7656, -1.1953, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.5469, -0.6172,  2.7031,  0.5156, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -1.1328,  2.4219, -1.2500, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:39,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:03:39,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.84 | bwd_microstep: 1825.01 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1823.89 | step_microstep: 1.93
[2025-11-06 18:03:39,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.44 | bwd: 1827.03 | bwd_inner: 2.94 | bwd_allreduce: 1823.93 | step: 2.02
 21%|██        | 739/3507 [18:53<1:01:53,  1.34s/it]                                                    {'loss': 0.6188, 'learning_rate': 1.833867410763858e-05, 'epoch': 0.21}
 21%|██        | 739/3507 [18:53<1:01:53,  1.34s/it]tensor([[-3.9844, -2.7188,  0.3242,  1.4141, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0938, -0.6953,  1.8984,  1.9688, -1.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.1562, -0.2363, -0.6289, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.6094, -1.3594,  2.2812, -1.3672, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:39,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.79 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.7188, -1.9688,  1.2969, -0.1182, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.1875, -0.4824,  1.6094, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -2.0938,  1.0938,  0.7578, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -3.5938, -0.1709,  1.4219, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:40,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:03:40,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.11 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.48
[2025-11-06 18:03:40,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 511.93 | bwd: 3.25 | bwd_inner: 2.23 | bwd_allreduce: 0.88 | step: 1.56
 21%|██        | 740/3507 [18:53<51:00,  1.11s/it]                                                    {'loss': 1.0972, 'learning_rate': 1.8333571907562034e-05, 'epoch': 0.21}
 21%|██        | 740/3507 [18:53<51:00,  1.11s/it]tensor([[-3.1875, -2.4844, -0.3457,  2.4688, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000, -0.3789,  3.0938, -0.2080, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -2.6875, -0.7656,  1.8047, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:40,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.79 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5312, -1.7656,  1.8047,  0.8438, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5469, -2.5312,  0.0439,  2.0781, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2188, -1.3906,  1.7812,  0.0142, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -1.1797,  2.0781, -0.1279, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -2.4375,  0.2988,  1.2266, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:41,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 18:03:41,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.75 | bwd_microstep: 1105.80 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1104.48 | step_microstep: 1.94
[2025-11-06 18:03:41,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 433.57 | bwd: 1106.64 | bwd_inner: 2.00 | bwd_allreduce: 1104.51 | step: 2.01
 21%|██        | 741/3507 [18:55<57:33,  1.25s/it]                                                  {'loss': 0.1863, 'learning_rate': 1.832846259669901e-05, 'epoch': 0.21}
 21%|██        | 741/3507 [18:55<57:33,  1.25s/it]tensor([[-1.2031,  0.5430,  3.2031,  1.1562, -0.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:41,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.67 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2500, -2.8594, -1.3672,  1.5000, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -1.9609,  0.8672,  0.8047, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -2.6250,  0.5469,  0.4883, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.2949, 2.1250, 4.7812, 2.3906, 0.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -2.0938, -0.2129,  2.5000, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -2.6875, -0.2578,  2.1875, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -2.2188,  1.7969, -0.7266, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:42,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 18:03:42,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.53 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.71 | step_microstep: 1.60
[2025-11-06 18:03:42,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 223.73 | bwd: 2.87 | bwd_inner: 1.98 | bwd_allreduce: 0.76 | step: 1.69
 21%|██        | 742/3507 [18:56<47:40,  1.03s/it]                                                  {'loss': 0.2886, 'learning_rate': 1.832334617940913e-05, 'epoch': 0.21}
 21%|██        | 742/3507 [18:56<47:40,  1.03s/it]tensor([[-3.3281, -1.3047,  1.5781, -1.6484, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -2.6250,  1.3125, -0.8516, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -4.1250, -0.3379,  0.4668, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:42,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.16 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4375, -3.0938,  1.2891, -1.6797, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -3.2500,  0.5312,  0.8789, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2031, -1.3047,  2.0312,  0.0315, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -2.4062,  1.1484,  1.4297, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7812, -5.0000, -0.6484, -0.5078, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:43,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:03:43,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.61 | bwd_microstep: 1080.68 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 1079.18 | step_microstep: 1.87
[2025-11-06 18:03:43,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.79 | bwd: 1081.68 | bwd_inner: 2.33 | bwd_allreduce: 1079.22 | step: 1.95
 21%|██        | 743/3507 [18:57<53:28,  1.16s/it]                                                  {'loss': 0.4259, 'learning_rate': 1.8318222660058082e-05, 'epoch': 0.21}
 21%|██        | 743/3507 [18:57<53:28,  1.16s/it]tensor([[-3.1875, -1.1328,  2.2500, -0.6250, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:43,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.27 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7344, -1.8281,  1.6484, -0.2461, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1562, -0.0854,  2.5312, -1.0703, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.6562, -0.9062,  2.2188,  1.3828, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4062, -0.5000,  1.5078,  3.1719, -0.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -3.8750, -0.4805,  0.2139, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.3438, -1.1328,  1.2031, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -2.5312,  0.2432,  1.0859, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:45,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.61 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:03:45,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.82 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.29
[2025-11-06 18:03:45,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.11 | bwd: 2.85 | bwd_inner: 1.84 | bwd_allreduce: 0.89 | step: 6.38
 21%|██        | 744/3507 [18:58<56:16,  1.22s/it]                                                  {'loss': 0.8101, 'learning_rate': 1.8313092043017606e-05, 'epoch': 0.21}
 21%|██        | 744/3507 [18:58<56:16,  1.22s/it]tensor([[-3.3594, -2.7500, -0.9336,  1.3047, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-7.0312, -5.8125, -2.3125, -0.9375, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -2.2344,  1.6406,  0.4316, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:45,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.66 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-4.2812, -2.3125,  1.6562, -0.0420, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -4.3438, -1.8281,  1.1719, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -3.5469,  0.5508, -1.8750, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2500, -2.3906, -0.1270,  1.8438, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0156, -0.2080,  2.1719,  0.1973, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:03:46,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.94 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:03:46,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.27 | bwd_microstep: 1059.25 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1058.01 | step_microstep: 4.77
[2025-11-06 18:03:46,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.96 | bwd: 1060.35 | bwd_inner: 2.16 | bwd_allreduce: 1058.05 | step: 4.86
 21%|██        | 745/3507 [19:00<1:00:24,  1.31s/it]                                                    {'loss': 0.9282, 'learning_rate': 1.830795433266551e-05, 'epoch': 0.21}
 21%|██        | 745/3507 [19:00<1:00:24,  1.31s/it]tensor([[-3.8594, -1.9141,  1.7266, -0.2178, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -2.4531,  1.4219, -0.1592, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -3.1875,  0.4629, -0.6484, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:46,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.50 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1250, -3.6094, -1.6250,  1.7812, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4688, -0.4238,  2.3906, -1.0547, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9375, -2.9688, -0.2480,  2.0625, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -3.4062, -1.2109,  1.7188, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -2.8281, -0.2480,  1.7422, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:47,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.26 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:03:47,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.76 | bwd_microstep: 355.26 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 354.09 | step_microstep: 3.48
[2025-11-06 18:03:47,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.21 | bwd: 356.22 | bwd_inner: 1.97 | bwd_allreduce: 354.13 | step: 3.56
 21%|██▏       | 746/3507 [19:01<52:06,  1.13s/it]                                                    {'loss': 0.1477, 'learning_rate': 1.8302809533385644e-05, 'epoch': 0.21}
 21%|██▏       | 746/3507 [19:01<52:06,  1.13s/it]tensor([[-3.7344, -1.9141,  1.6172,  0.6719, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250, -1.0938,  2.4531, -0.0217, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:47,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.63 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3281, -1.4688,  1.9766,  0.5234, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -3.2344, -1.0781,  1.7734, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.5625, -3.1406,  0.1611,  0.8828, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6484,  0.5547,  3.7500,  0.4316, -1.3828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -3.2969, -0.1348, -0.1787, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -1.2656,  2.6094, -0.2461, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:03:49,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.02 | optimizer_gradients: 0.27 | optimizer_step: 0.22
[2025-11-06 18:03:49,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.82 | bwd_microstep: 1603.04 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 1601.74 | step_microstep: 5.30
[2025-11-06 18:03:49,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.47 | bwd: 1604.22 | bwd_inner: 2.26 | bwd_allreduce: 1601.80 | step: 5.39
 21%|██▏       | 747/3507 [19:03<1:03:09,  1.37s/it]                                                    {'loss': 0.8251, 'learning_rate': 1.8297657649567912e-05, 'epoch': 0.21}
 21%|██▏       | 747/3507 [19:03<1:03:09,  1.37s/it]tensor([[-3.0625, -1.2031,  2.0469, -0.0330, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:49,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.57 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5781, -2.2656,  0.6992,  1.7734, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.4688,  1.7031, -0.1172, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -3.5625, -0.6797,  1.0703, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9219, -0.7656,  2.6875, -0.3613, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.0156, 1.8125, 3.1875, 5.1875, 1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -3.7500, -1.6094,  1.5000, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7812, -1.9141,  0.2197,  2.2344, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:50,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:03:50,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.54 | bwd_microstep: 730.54 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 729.40 | step_microstep: 1.92
[2025-11-06 18:03:50,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.12 | bwd: 731.41 | bwd_inner: 1.80 | bwd_allreduce: 729.46 | step: 2.01
 21%|██▏       | 748/3507 [19:04<58:58,  1.28s/it]                                                    {'loss': 0.1666, 'learning_rate': 1.8292498685608257e-05, 'epoch': 0.21}
 21%|██▏       | 748/3507 [19:04<58:58,  1.28s/it]tensor([[-5.2812, -3.1094,  1.2266, -0.7266, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0938, -1.1719,  2.4375,  0.6641, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500, -1.6406,  0.9336,  2.3750, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:50,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.7188, -1.1406,  0.3301,  2.9531, -0.7930]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0156, -0.8398,  2.6562, -0.7344, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -2.8594,  0.3418,  0.8320, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -3.5625,  0.2520,  0.8047, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.0938, -0.0060,  0.3398, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:50,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 18:03:50,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.43 | bwd_microstep: 276.66 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 275.52 | step_microstep: 2.36
[2025-11-06 18:03:50,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.65 | bwd: 277.51 | bwd_inner: 1.80 | bwd_allreduce: 275.57 | step: 2.44
 21%|██▏       | 749/3507 [19:04<51:06,  1.11s/it]                                                  {'loss': 0.9816, 'learning_rate': 1.828733264590867e-05, 'epoch': 0.21}
 21%|██▏       | 749/3507 [19:04<51:06,  1.11s/it]tensor([[-3.3125, -0.9336,  2.6406, -1.7266, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:03:51,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.14 | bwd_microstep: 2.55 | bwd_inner_microstep: 2.25 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.18
tensor([[-4.7500, -3.0469,  0.8242,  1.2422, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3594, -0.7188,  0.8789,  3.5312, -0.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4219, -0.3145,  2.6406, -0.3867, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -1.8281,  1.3984,  0.3164, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -1.4375,  2.7031, -1.2031, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4844, -0.4004,  2.4844, -1.5469, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -2.1719,  0.8477, -1.6328, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:03:53,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.37 | optimizer_step: 0.35
[2025-11-06 18:03:53,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.78 | bwd_microstep: 1810.49 | bwd_inner_microstep: 1.70 | bwd_allreduce_microstep: 1808.62 | step_microstep: 3.71
[2025-11-06 18:03:53,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.96 | bwd: 1813.03 | bwd_inner: 3.98 | bwd_allreduce: 1808.75 | step: 3.89
 21%|██▏       | 750/3507 [19:06<1:05:31,  1.43s/it]                                                    {'loss': 1.185, 'learning_rate': 1.8282159534877183e-05, 'epoch': 0.21}
 21%|██▏       | 750/3507 [19:06<1:05:31,  1.43s/it]tensor([[-3.0625, -0.8125,  2.5938, -0.6602, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -3.1094,  0.6523,  0.1924, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9844, -2.3750, -0.7109,  1.5625, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:03:53,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.24 | bwd_microstep: 3.22 | bwd_inner_microstep: 2.94 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.20
tensor([[-3.5156, -2.1875,  0.7109,  1.3359, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -3.7656, -0.9180,  0.2334, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -3.9688, -1.3906,  0.7734, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.2188, -0.5664,  1.4531, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7344, -0.5703,  2.6250, -0.2500, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:03:54,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.33 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:03:54,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.04 | bwd_microstep: 516.58 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 515.59 | step_microstep: 3.64
[2025-11-06 18:03:54,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.28 | bwd: 519.79 | bwd_inner: 3.89 | bwd_allreduce: 515.68 | step: 3.84
 21%|██▏       | 751/3507 [19:07<58:21,  1.27s/it]                                                    {'loss': 0.3735, 'learning_rate': 1.8276979356927853e-05, 'epoch': 0.21}
 21%|██▏       | 751/3507 [19:07<58:21,  1.27s/it]tensor([[-4.1250, -2.7031,  0.4219,  0.8750, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:54,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.58 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.2969, -1.5391,  0.2715,  2.3906, -1.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750, -1.8438,  1.0391,  0.4180, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4766, -0.4121,  1.7266,  2.7031, -0.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9375e+00, -9.7656e-01,  2.3125e+00, -2.6245e-03, -2.3594e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1406, -0.2080,  2.3906, -0.3301, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.1562, -3.0469, -0.2969,  1.2578, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -2.9062, -0.1982,  1.2812, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:56,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.22 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:03:56,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.32 | bwd_microstep: 1864.43 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1863.45 | step_microstep: 3.31
[2025-11-06 18:03:56,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.93 | bwd: 1865.37 | bwd_inner: 1.76 | bwd_allreduce: 1863.49 | step: 3.38
 21%|██▏       | 752/3507 [19:10<1:11:40,  1.56s/it]                                                    {'loss': 0.7466, 'learning_rate': 1.8271792116480767e-05, 'epoch': 0.21}
 21%|██▏       | 752/3507 [19:10<1:11:40,  1.56s/it]tensor([[2.4219, 2.7031, 2.8594, 5.1250, 2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -3.1094, -1.1953,  1.3906, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0078,  0.8945,  3.0938, -0.2324, -0.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.2812, -2.5938,  1.1172,  1.2031, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:56,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.07 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.7344, -1.1484,  1.5312,  0.0620, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.5156, -1.2656,  1.4609, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4375, -1.5156,  1.4766, -0.5352, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3594, -0.1836,  2.6250, -1.0234, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:03:56,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:03:56,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.68 | bwd_microstep: 46.72 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 45.60 | step_microstep: 2.67
[2025-11-06 18:03:56,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.77 | bwd: 47.66 | bwd_inner: 1.89 | bwd_allreduce: 45.63 | step: 2.74
 21%|██▏       | 753/3507 [19:10<56:14,  1.23s/it]                                                    {'loss': 0.8626, 'learning_rate': 1.8266597817962042e-05, 'epoch': 0.21}
 21%|██▏       | 753/3507 [19:10<56:14,  1.23s/it]tensor([[-3.8125, -1.5703,  2.0781, -1.0781, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:03:56,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.08 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4219, -2.4375,  0.0182,  1.6641, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062, -2.2656, -0.3281,  2.6562, -1.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -2.1406,  0.3496,  0.9297, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -2.3594,  0.4824,  1.8516, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -3.5312,  0.6211,  0.4570, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -3.8906, -0.1035, -1.5625, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -2.1719,  0.8672,  1.6719, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:03:59,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:03:59,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.10 | bwd_microstep: 2655.08 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 2653.96 | step_microstep: 2.37
[2025-11-06 18:03:59,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.21 | bwd: 2655.88 | bwd_inner: 1.75 | bwd_allreduce: 2654.00 | step: 2.44
 21%|██▏       | 754/3507 [19:13<1:20:33,  1.76s/it]                                                    {'loss': 0.7125, 'learning_rate': 1.826139646580382e-05, 'epoch': 0.21}
 21%|██▏       | 754/3507 [19:13<1:20:33,  1.76s/it]tensor([[-4.4062, -2.9844,  0.3848,  1.3828, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -3.1719, -0.8008,  2.1719, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.4688, -0.8867,  1.1406, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9062, -1.9766,  1.3359, -0.3848, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:03:59,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.88 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -2.4219,  1.0391, -0.8594, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4688, -0.7266,  1.9453,  0.1289, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.5938, -0.0503,  2.6406,  2.6562, -0.9492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -2.4375,  1.8047, -0.1260, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:00,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.14 | optimizer_step: 0.14
[2025-11-06 18:04:00,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.03 | bwd_microstep: 59.73 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 58.69 | step_microstep: 1.38
[2025-11-06 18:04:00,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.94 | bwd: 60.59 | bwd_inner: 1.73 | bwd_allreduce: 58.73 | step: 1.45
 22%|██▏       | 755/3507 [19:13<1:02:27,  1.36s/it]                                                    {'loss': 0.2592, 'learning_rate': 1.825618806444426e-05, 'epoch': 0.22}
 22%|██▏       | 755/3507 [19:13<1:02:27,  1.36s/it]tensor([[-1.8906, -1.4141, -0.0488,  2.6406, -0.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -3.5938, -1.5938,  1.1953, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4531, -1.8516,  0.0544,  3.5000, -1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:00,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.86 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5000, -2.5000, -0.0432,  2.3125, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -2.6719,  0.1934,  1.0000, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -3.2188, -0.7891,  1.9219, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -3.8594, -1.4453,  0.9727, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781, -2.5938,  0.0645,  2.4531, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:02,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:04:02,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.07 | bwd_microstep: 2203.91 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2202.68 | step_microstep: 2.08
[2025-11-06 18:04:02,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.96 | bwd: 2204.80 | bwd_inner: 1.91 | bwd_allreduce: 2202.73 | step: 2.16
 22%|██▏       | 756/3507 [19:16<1:19:59,  1.74s/it]                                                    {'loss': 0.1321, 'learning_rate': 1.8250972618327528e-05, 'epoch': 0.22}
 22%|██▏       | 756/3507 [19:16<1:19:59,  1.74s/it]tensor([[-3.5312, -1.2969,  2.2812, -1.0469, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -2.4688, -1.0000,  1.6172, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:02,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.85 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.5312, -1.6953,  0.4375,  2.8281, -1.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9062, -4.1562,  0.0305,  0.2832, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1562,  0.8828,  3.3281,  0.6953, -0.9180]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.9688, -3.3281, -0.2148, -0.9297, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -2.2500,  2.0781, -0.9492, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500, -0.5938,  2.7969, -0.3379, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:03,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:04:03,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.30 | bwd_microstep: 115.40 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 114.26 | step_microstep: 1.67
[2025-11-06 18:04:03,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.18 | bwd: 116.29 | bwd_inner: 1.87 | bwd_allreduce: 114.29 | step: 1.74
 22%|██▏       | 757/3507 [19:17<1:02:09,  1.36s/it]                                                    {'loss': 0.6017, 'learning_rate': 1.8245750131903813e-05, 'epoch': 0.22}
 22%|██▏       | 757/3507 [19:17<1:02:09,  1.36s/it]tensor([[-3.4688, -2.0938,  0.6172,  0.6680, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -3.0781,  1.2891, -0.0427, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:03,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.96 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.4375, -5.2188, -1.6484,  0.6172, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.2500,  1.4844,  0.5156, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -1.8281,  1.2578,  0.4727, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1562, -1.4688,  0.2314,  2.2656, -1.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -1.9688,  1.1094,  0.1030, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -3.0312, -0.6367,  1.3047, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:04,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:04:04,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.31 | bwd_microstep: 1165.58 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1164.43 | step_microstep: 1.99
[2025-11-06 18:04:04,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.29 | bwd: 1166.51 | bwd_inner: 1.92 | bwd_allreduce: 1164.48 | step: 2.06
 22%|██▏       | 758/3507 [19:18<1:04:46,  1.41s/it]                                                    {'loss': 0.4521, 'learning_rate': 1.82405206096293e-05, 'epoch': 0.22}
 22%|██▏       | 758/3507 [19:18<1:04:46,  1.41s/it]tensor([[-5.9688, -4.7188, -1.6094, -0.6133, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -3.7031, -1.9375,  1.5938, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:04,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.01 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1250, -2.3750,  0.9141, -0.0630, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -2.8594,  1.0234,  0.7031, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4688, -6.1250, -2.3125, -0.2207, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.0000, -0.0767,  0.6523, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -2.3750,  0.4355,  1.8828, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -1.0469,  2.9531, -0.7812, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:05,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:04:05,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.67 | bwd_microstep: 1.89 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.97
[2025-11-06 18:04:05,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.70 | bwd: 2.75 | bwd_inner: 1.83 | bwd_allreduce: 0.80 | step: 2.06
 22%|██▏       | 759/3507 [19:19<51:23,  1.12s/it]                                                    {'loss': 0.491, 'learning_rate': 1.8235284055966192e-05, 'epoch': 0.22}
 22%|██▏       | 759/3507 [19:19<51:23,  1.12s/it]tensor([[-7.0000, -6.2812, -3.5000,  0.0747, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -3.3125,  0.1826,  1.2031, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:05,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.45 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8125, -1.6250,  2.1094, -0.5273, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469, -2.6406, -0.2500,  1.7188, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6094, -0.2412,  2.9062, -1.1562, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -3.9844, -0.9727,  1.1094, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3438, -1.5312,  0.4961,  2.3906, -1.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -4.7188, -1.4219,  0.0145, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:07,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:04:07,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.12 | bwd_microstep: 1671.27 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1670.17 | step_microstep: 2.44
[2025-11-06 18:04:07,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.59 | bwd: 1672.16 | bwd_inner: 1.81 | bwd_allreduce: 1670.21 | step: 2.52
 22%|██▏       | 760/3507 [19:21<1:05:45,  1.44s/it]                                                    {'loss': 0.527, 'learning_rate': 1.8230040475382672e-05, 'epoch': 0.22}
 22%|██▏       | 760/3507 [19:21<1:05:45,  1.44s/it]tensor([[-5.0938, -3.3438,  0.3574,  0.4609, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:07,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.58 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6719, -2.8438, -0.5508,  1.8047, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -2.0469,  1.4766, -0.1050, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4688, -2.5156, -0.0771,  2.1562, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -1.5703,  2.6406, -1.1875, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6094, -0.7695,  1.9453,  0.1846, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -1.2578,  2.1250, -0.5898, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -1.6484,  2.4062, -0.6406, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:04:08,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:04:08,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.29 | bwd_microstep: 455.77 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 454.80 | step_microstep: 1.71
[2025-11-06 18:04:08,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 190.88 | bwd: 456.84 | bwd_inner: 1.88 | bwd_allreduce: 454.83 | step: 1.79
 22%|██▏       | 761/3507 [19:21<55:17,  1.21s/it]                                                    {'loss': 0.2097, 'learning_rate': 1.822478987235294e-05, 'epoch': 0.22}
 22%|██▏       | 761/3507 [19:21<55:17,  1.21s/it]tensor([[-3.5000, -2.0312,  1.0469,  1.5859, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -0.2383,  2.4688, -0.4473, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:04:08,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.35 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -2.7188,  1.0000,  1.2266, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -2.5000, -0.0693,  1.6172, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -2.2500,  1.1484, -1.9062, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7969, -1.4141,  1.5078,  2.4219, -1.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.4531, -1.2344,  1.7422, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2422,  0.3164,  3.1250,  3.0781, -0.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:08,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:04:08,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.35 | bwd_microstep: 553.73 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 552.72 | step_microstep: 1.89
[2025-11-06 18:04:08,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.73 | bwd: 554.57 | bwd_inner: 1.68 | bwd_allreduce: 552.77 | step: 1.98
 22%|██▏       | 762/3507 [19:22<50:39,  1.11s/it]                                                  {'loss': 0.8889, 'learning_rate': 1.821953225135717e-05, 'epoch': 0.22}
 22%|██▏       | 762/3507 [19:22<50:39,  1.11s/it]tensor([[-3.5000, -1.8203,  1.4219,  1.4375, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:09,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.11 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.9219, -3.0312, -0.6602,  1.5781, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3594, -1.9062, -0.3203,  2.9844, -1.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -3.3906,  0.5547, -0.0613, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8516, -1.2969,  0.3262,  3.3438, -0.8711]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -1.0859,  2.4062, -0.5781, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -2.3594,  0.6367,  0.6016, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -4.3438, -1.9766,  1.1562, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:04:09,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:04:09,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.60 | bwd_microstep: 20.34 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 19.26 | step_microstep: 1.54
[2025-11-06 18:04:09,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.74 | bwd: 21.22 | bwd_inner: 1.80 | bwd_allreduce: 19.30 | step: 1.63
 22%|██▏       | 763/3507 [19:23<41:50,  1.09it/s]                                                  {'loss': 0.2892, 'learning_rate': 1.8214267616881535e-05, 'epoch': 0.22}
 22%|██▏       | 763/3507 [19:23<41:50,  1.09it/s]tensor([[-3.5781, -1.4297,  2.1094, -0.5430, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812, -1.3125,  1.2812,  0.7539, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -3.0156, -0.2109,  1.8203, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:09,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.00 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6875, -0.9062,  1.4375, -0.3867, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4375, -2.6406,  1.0469,  0.6289, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7188, -1.8047,  1.6641, -0.1885, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -1.8906,  1.2500, -0.3379, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.8750, -3.4375, -1.8203,  1.2031, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:04:12,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:04:12,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.81 | bwd_microstep: 2605.23 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 2603.86 | step_microstep: 1.85
[2025-11-06 18:04:12,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.84 | bwd: 2606.07 | bwd_inner: 2.03 | bwd_allreduce: 2603.91 | step: 1.93
 22%|██▏       | 764/3507 [19:26<1:10:06,  1.53s/it]                                                    {'loss': 1.4106, 'learning_rate': 1.8208995973418192e-05, 'epoch': 0.22}
 22%|██▏       | 764/3507 [19:26<1:10:06,  1.53s/it]tensor([[-5.3438, -3.0469,  1.3594, -0.9609, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -1.3125,  2.3750,  1.2344, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:12,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.22 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.6719, -1.5938,  1.6172, -0.9141, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6797, -0.0095,  1.8984, -0.1572, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2188, -3.1406,  1.1172,  0.0315, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.3281,  0.5898,  2.9844,  0.9805, -0.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.3125, -4.5000, -0.3340, -0.2168, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -2.8281,  0.0972,  1.3750, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:04:12,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:04:12,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.18 | bwd_microstep: 104.74 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 103.53 | step_microstep: 1.51
[2025-11-06 18:04:12,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.43 | bwd: 105.75 | bwd_inner: 2.03 | bwd_allreduce: 103.57 | step: 1.61
 22%|██▏       | 765/3507 [19:26<55:56,  1.22s/it]                                                    {'loss': 0.8243, 'learning_rate': 1.820371732546527e-05, 'epoch': 0.22}
 22%|██▏       | 765/3507 [19:26<55:56,  1.22s/it]tensor([[-3.5000, -2.8125, -0.6406,  2.4688, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:13,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.20 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0000, -3.9688, -1.0781,  1.4844, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3594, -2.4531, -0.1592,  1.9531, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9844, -0.9961,  2.0156, -0.5625, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -2.6250,  0.2578,  1.6406, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -2.5156,  1.2656,  0.2871, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -2.7812,  1.3516, -2.0000, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -2.7031,  0.2451,  0.8828, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:15,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:04:15,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 1833.44 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1832.31 | step_microstep: 1.94
[2025-11-06 18:04:15,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.61 | bwd: 1834.42 | bwd_inner: 1.95 | bwd_allreduce: 1832.34 | step: 2.02
 22%|██▏       | 766/3507 [19:28<1:09:39,  1.52s/it]                                                    {'loss': 0.5102, 'learning_rate': 1.819843167752689e-05, 'epoch': 0.22}
 22%|██▏       | 766/3507 [19:28<1:09:39,  1.52s/it]tensor([[-3.5156, -1.2422,  2.1719, -1.1953, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.1719, -0.5664,  1.2812, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:15,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.73 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6406, -1.5234,  0.9922,  2.3906, -1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -4.1875, -0.4102,  0.2227, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -2.5000, -0.2500,  2.4844, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0312, -1.5312,  1.2188,  0.3574, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9297, -0.1357,  2.4531,  0.7852, -1.4453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5625, -0.7539,  2.0312, -0.1475, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:15,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:04:15,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 1.88 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.48
[2025-11-06 18:04:15,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.28 | bwd: 2.91 | bwd_inner: 1.95 | bwd_allreduce: 0.83 | step: 1.56
 22%|██▏       | 767/3507 [19:29<55:17,  1.21s/it]                                                    {'loss': 0.3443, 'learning_rate': 1.8193139034113124e-05, 'epoch': 0.22}
 22%|██▏       | 767/3507 [19:29<55:17,  1.21s/it]tensor([[-6.2188, -4.7188, -1.4844, -1.4453, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -2.9688,  0.2598,  0.5000, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -2.4062, -0.2412,  2.3125, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:15,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.53 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.5312, -0.8555,  1.6562, -0.1641, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6250, -2.0781,  0.7734,  0.3008, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -1.0391,  2.8906, -0.8945, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.4375, -1.9688,  0.8555, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -0.8516,  2.2969, -0.1533, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:04:17,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:04:17,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 83.60 | bwd_microstep: 1924.41 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1923.14 | step_microstep: 1.96
[2025-11-06 18:04:17,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.14 | bwd: 1925.52 | bwd_inner: 2.21 | bwd_allreduce: 1923.18 | step: 2.04
 22%|██▏       | 768/3507 [19:31<1:09:25,  1.52s/it]                                                    {'loss': 0.314, 'learning_rate': 1.8187839399740034e-05, 'epoch': 0.22}
 22%|██▏       | 768/3507 [19:31<1:09:25,  1.52s/it]tensor([[-3.2031, -2.1562,  0.3203,  2.3281, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -4.4375, -1.1484,  0.6602, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938, -3.0938, -1.2500,  2.0625, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.1094, -2.0156,  0.2432,  1.1172, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -2.7500,  1.3281, -0.1030, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:18,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.16 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -3.7969, -0.1719, -2.0625, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.3438, -1.7500,  0.1260,  3.4062, -1.2266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -2.4062, -0.0918,  1.7969, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:18,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:04:18,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.67 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.37
[2025-11-06 18:04:18,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 495.87 | bwd: 2.83 | bwd_inner: 1.80 | bwd_allreduce: 0.89 | step: 2.44
 22%|██▏       | 769/3507 [19:32<55:59,  1.23s/it]                                                    {'loss': 1.4253, 'learning_rate': 1.8182532778929637e-05, 'epoch': 0.22}
 22%|██▏       | 769/3507 [19:32<55:59,  1.23s/it]tensor([[-4.1875, -3.0781, -0.2373,  1.7266, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0469,  0.1099,  2.6406, -1.2656, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:04:18,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.17 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4219, -0.5977,  2.0938,  0.5508, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5156, -2.5000, -0.1416,  1.4844, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -4.8125, -1.0234, -3.3125, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.2812, -3.0781, -0.1533,  1.3828, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -2.9531, -1.4844,  1.0078, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7812, -2.1719, -0.3301,  2.7656, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:20,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:04:20,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.43 | bwd_microstep: 2175.09 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 2174.05 | step_microstep: 1.95
[2025-11-06 18:04:20,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.63 | bwd: 2175.88 | bwd_inner: 1.65 | bwd_allreduce: 2174.09 | step: 2.02
 22%|██▏       | 770/3507 [19:34<1:13:53,  1.62s/it]                                                    {'loss': 1.1566, 'learning_rate': 1.8177219176209915e-05, 'epoch': 0.22}
 22%|██▏       | 770/3507 [19:34<1:13:53,  1.62s/it]tensor([[-3.6875, -2.3750,  0.5742,  1.8438, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8438, -1.2266,  0.4121,  3.2969, -0.8633]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -2.3438,  0.5820,  0.9062, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -1.8828,  1.8438,  1.0781, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:21,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.66 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9219, -0.8555,  2.3281,  0.0615, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1875, -2.7969, -1.2422,  1.9531, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.1562, -0.6641,  1.9219,  2.0156, -1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3984, 1.1172, 2.4375, 4.6250, 0.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:21,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.21
[2025-11-06 18:04:21,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.03 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.91 | step_microstep: 1.66
[2025-11-06 18:04:21,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 494.71 | bwd: 3.03 | bwd_inner: 1.95 | bwd_allreduce: 0.95 | step: 1.74
 22%|██▏       | 771/3507 [19:35<59:05,  1.30s/it]                                                    {'loss': 0.8369, 'learning_rate': 1.8171898596114804e-05, 'epoch': 0.22}
 22%|██▏       | 771/3507 [19:35<59:05,  1.30s/it]tensor([[-3.0781, -1.3438,  1.8828,  1.3594, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.5156, -0.5508,  1.2266, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375, -3.3594, -1.3828,  1.8281, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:21,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.12 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.9688, -3.6250, -0.3984,  0.9180, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -3.0938, -0.9688,  2.1250, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8906, -1.1797,  1.7891,  1.3984, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.7969,  0.1152,  0.0903, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -1.7109,  2.3906,  0.0505, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:04:22,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:04:22,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.52 | bwd_microstep: 491.22 | bwd_inner_microstep: 1.51 | bwd_allreduce_microstep: 489.61 | step_microstep: 1.66
[2025-11-06 18:04:22,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.67 | bwd: 492.12 | bwd_inner: 2.32 | bwd_allreduce: 489.66 | step: 1.75
 22%|██▏       | 772/3507 [19:36<53:05,  1.16s/it]                                                  {'loss': 0.3735, 'learning_rate': 1.8166571043184193e-05, 'epoch': 0.22}
 22%|██▏       | 772/3507 [19:36<53:05,  1.16s/it]tensor([[-4.4062, -2.9375,  0.1621,  0.9141, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -2.3438,  0.3477,  1.0703, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219, -2.2500,  0.2617,  1.0625, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -3.7969, -0.9180,  0.9766, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:22,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.24 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-5.0312, -3.2656,  0.5977,  0.8984, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.3438,  0.6016, -0.1465, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -2.2969,  1.5859, -0.3945, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.7188, -0.9727,  1.5469, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:23,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 18:04:23,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.78 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.91 | step_microstep: 1.89
[2025-11-06 18:04:23,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.04 | bwd: 2.80 | bwd_inner: 1.75 | bwd_allreduce: 0.94 | step: 1.96
 22%|██▏       | 773/3507 [19:37<57:17,  1.26s/it]                                                  {'loss': 0.3326, 'learning_rate': 1.8161236521963928e-05, 'epoch': 0.22}
 22%|██▏       | 773/3507 [19:37<57:17,  1.26s/it]tensor([[-4.5938, -2.4688,  1.4297, -0.1299, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.5938,  1.5391, -0.3594, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.0625,  0.5938,  0.9258, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -1.1406,  2.5938, -0.6328, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:24,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.09 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9922, -0.4238,  2.1250,  1.8438, -1.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -1.7188,  1.6172,  0.2490, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -3.4219,  0.6016, -1.4219, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2344, -2.2969,  0.0447,  2.2500, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:24,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.21 | optimizer_step: 0.22
[2025-11-06 18:04:24,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.49 | bwd_microstep: 2.29 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.99 | step_microstep: 1.97
[2025-11-06 18:04:24,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.61 | bwd: 3.11 | bwd_inner: 1.92 | bwd_allreduce: 1.03 | step: 2.05
 22%|██▏       | 774/3507 [19:38<46:16,  1.02s/it]                                                  {'loss': 0.8754, 'learning_rate': 1.815589503700579e-05, 'epoch': 0.22}
 22%|██▏       | 774/3507 [19:38<46:16,  1.02s/it]tensor([[-5.1875, -3.8125, -0.4336,  0.8516, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -2.8906, -0.4512,  1.5312, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -1.6172,  1.4297,  0.4121, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0938, -2.6875, -1.1641,  1.7812, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2031, -0.9219,  1.7422,  3.1250, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:24,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.34 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3906, -1.7969,  1.1562,  1.2812, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5312, -0.6523,  2.1562,  0.8906, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.7812, -2.4062,  0.5547, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:04:26,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 18:04:26,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.76 | bwd_microstep: 1726.24 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1725.18 | step_microstep: 2.23
[2025-11-06 18:04:26,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.10 | bwd: 1727.07 | bwd_inner: 1.69 | bwd_allreduce: 1725.24 | step: 2.32
 22%|██▏       | 775/3507 [19:40<1:07:51,  1.49s/it]                                                    {'loss': 0.8222, 'learning_rate': 1.8150546592867505e-05, 'epoch': 0.22}
 22%|██▏       | 775/3507 [19:40<1:07:51,  1.49s/it]tensor([[-3.2969, -1.3516,  1.8906, -0.1060, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:27,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.71 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7969, -3.0469, -0.9219,  1.6484, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406, -2.2656, -0.0306,  2.0312, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9375,  0.1494,  2.6094, -0.3438, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.1562, -3.1719,  0.9688,  0.9062, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -2.4219,  1.2109,  0.3242, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -2.6719, -1.2500,  1.7891, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -2.0938,  0.7500,  1.1875, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:27,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:04:27,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 56.19 | bwd_microstep: 204.46 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 203.53 | step_microstep: 2.55
[2025-11-06 18:04:27,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 219.92 | bwd: 205.45 | bwd_inner: 1.75 | bwd_allreduce: 203.57 | step: 2.63
 22%|██▏       | 776/3507 [19:41<53:44,  1.18s/it]                                                    {'loss': 1.0836, 'learning_rate': 1.814519119411275e-05, 'epoch': 0.22}
 22%|██▏       | 776/3507 [19:41<53:44,  1.18s/it]tensor([[-4.6562, -4.1250, -2.0625,  1.4375, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5469, -1.5078,  0.8047,  2.4844, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:27,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.75 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.1562, -2.3281, -0.1631,  2.1250, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594, -2.4531, -0.9961,  2.2969, -1.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6250, -0.4043,  2.6875, -0.2676, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -4.4688, -1.1016,  0.3633, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.5938,  0.7148, -1.7734, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -4.2188, -1.7344,  0.9297, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:04:29,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.16 | optimizer_step: 0.14
[2025-11-06 18:04:29,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.91 | bwd_microstep: 1644.39 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1643.24 | step_microstep: 2.09
[2025-11-06 18:04:29,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.69 | bwd: 1645.28 | bwd_inner: 1.89 | bwd_allreduce: 1643.27 | step: 2.16
 22%|██▏       | 777/3507 [19:43<1:04:39,  1.42s/it]                                                    {'loss': 0.3328, 'learning_rate': 1.8139828845311118e-05, 'epoch': 0.22}
 22%|██▏       | 777/3507 [19:43<1:04:39,  1.42s/it]tensor([[-3.1250, -1.3438,  1.6875,  1.1797, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3438, -0.4922,  1.9609,  0.4023, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -1.7422,  1.2969, -1.8516, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:04:29,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.42 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.1875, -2.7500,  0.4785,  1.7422, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -3.0781, -1.2109,  1.8594, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -1.6484,  1.9531, -0.3633, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9688, -0.5664,  2.6406, -1.1562, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -2.9844, -0.1367,  1.3828, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:29,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:04:29,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.82 | bwd_microstep: 43.10 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 41.98 | step_microstep: 2.07
[2025-11-06 18:04:29,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 410.27 | bwd: 44.02 | bwd_inner: 1.87 | bwd_allreduce: 42.02 | step: 2.14
 22%|██▏       | 778/3507 [19:43<52:00,  1.14s/it]                                                    {'loss': 0.651, 'learning_rate': 1.8134459551038143e-05, 'epoch': 0.22}
 22%|██▏       | 778/3507 [19:43<52:00,  1.14s/it]tensor([[-2.5938, -0.7773,  1.4062, -0.4375, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
[2025-11-06 18:04:29,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.62 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9375, -1.6953,  1.7578, -0.8281, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3496,  0.5781,  2.1094,  3.2969,  0.1904]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -1.8047,  1.8828, -1.1562, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -2.6562,  1.2734, -0.4883, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -3.0156,  0.4238,  1.5547, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.5625, -1.1719,  1.2969, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.1562, -0.7344,  1.3438, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:04:31,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.20 | optimizer_step: 0.26
[2025-11-06 18:04:31,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.92 | bwd_microstep: 1775.75 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1774.54 | step_microstep: 2.38
[2025-11-06 18:04:31,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 229.54 | bwd: 1776.76 | bwd_inner: 2.00 | bwd_allreduce: 1774.60 | step: 2.46
 22%|██▏       | 779/3507 [19:45<1:04:12,  1.41s/it]                                                    {'loss': 0.6948, 'learning_rate': 1.8129083315875282e-05, 'epoch': 0.22}
 22%|██▏       | 779/3507 [19:45<1:04:12,  1.41s/it]tensor([[-4.0625, -1.7031,  1.7969, -1.4375, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:31,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.53 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.3281, -1.4453,  1.5859,  0.5078, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.2969, -0.2051,  2.5312, -0.0408, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([2], device='cuda:3')
tensor([[-5.2812, -3.7031, -0.3789, -0.1982, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812, -1.0938,  2.3906,  0.2715, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562, -1.5703,  1.8047, -0.4434, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1250, -1.1484,  1.8828, -0.1162, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -3.4219, -0.4277,  1.3281, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:04:32,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:04:32,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.84 | bwd_microstep: 117.67 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 116.52 | step_microstep: 2.02
[2025-11-06 18:04:32,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.38 | bwd: 118.71 | bwd_inner: 2.04 | bwd_allreduce: 116.55 | step: 2.09
 22%|██▏       | 780/3507 [19:46<50:53,  1.12s/it]                                                    {'loss': 0.2324, 'learning_rate': 1.8123700144409916e-05, 'epoch': 0.22}
 22%|██▏       | 780/3507 [19:46<50:53,  1.12s/it]tensor([[-5.3750, -3.3906,  0.2344, -0.7969, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -4.4688, -0.9961, -0.1934, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:32,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.23 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0625,  0.0493,  2.5156, -1.1328, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -0.8086,  1.8047, -1.0391, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5625, -2.6719,  0.9922,  0.8398, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.1250,  1.4219, -2.7812, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7188, -0.4961,  2.4375, -1.0234, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5938, -2.3438,  1.5547, -0.5625, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:34,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.25
[2025-11-06 18:04:34,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.24 | bwd_microstep: 1711.49 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1710.28 | step_microstep: 2.42
[2025-11-06 18:04:34,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.50 | bwd: 1712.41 | bwd_inner: 1.96 | bwd_allreduce: 1710.32 | step: 2.50
 22%|██▏       | 781/3507 [19:48<1:04:04,  1.41s/it]                                                    {'loss': 0.9345, 'learning_rate': 1.811831004123534e-05, 'epoch': 0.22}
 22%|██▏       | 781/3507 [19:48<1:04:04,  1.41s/it]tensor([[-3.7500, -3.0312, -1.2344,  0.6016, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -3.1406,  0.0840,  1.6719, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -2.1406,  1.1797,  0.2236, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:34,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.12 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3438, -2.8906,  0.2598,  1.4844, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.8438, -2.5156,  0.4707,  1.8750, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-3.5781, -1.0859,  2.6875, -1.0234, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.5469, -0.8789,  1.3828, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.7812, -0.6914,  1.7969, -1.0625, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([1], device='cuda:1')
[2025-11-06 18:04:34,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.26 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:04:34,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.01 | bwd_microstep: 2.09 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.87 | step_microstep: 3.42
[2025-11-06 18:04:34,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.17 | bwd: 3.15 | bwd_inner: 2.12 | bwd_allreduce: 0.90 | step: 3.52
 22%|██▏       | 782/3507 [19:48<50:54,  1.12s/it]                                                    {'loss': 0.7956, 'learning_rate': 1.811291301095077e-05, 'epoch': 0.22}
 22%|██▏       | 782/3507 [19:48<50:54,  1.12s/it]tensor([[-3.7812, -2.9844, -0.6562,  2.0938, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:35,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.24 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.0625, -2.5156,  0.6289,  1.1953, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -2.2031,  1.0312, -2.0625, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.1250, -3.7188, -0.4355,  0.7734, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -1.2109,  1.9922,  1.1094, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.8125, -2.0938,  1.2734, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -2.8906,  1.0625, -0.3164, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -3.1094,  0.7539, -2.0000, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:38,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:04:38,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.39 | bwd_microstep: 3405.53 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 3404.38 | step_microstep: 3.34
[2025-11-06 18:04:38,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.65 | bwd: 3406.54 | bwd_inner: 1.96 | bwd_allreduce: 3404.43 | step: 3.42
 22%|██▏       | 783/3507 [19:52<1:26:48,  1.91s/it]                                                    {'loss': 0.6897, 'learning_rate': 1.8107509058161328e-05, 'epoch': 0.22}
 22%|██▏       | 783/3507 [19:52<1:26:48,  1.91s/it]tensor([[-4.9062, -4.1562, -1.6562,  1.6250, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:38,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.02 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-3.5625, -3.0156, -1.2500,  1.5781, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7812, -2.1094,  0.8945,  0.9805, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -3.2500,  0.9375, -0.4492, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -2.7969,  0.2012,  0.4746, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -3.1250, -0.3066,  1.1250, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -4.5000, -1.7891, -1.7969, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -3.5625, -0.0703,  0.8906, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:39,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.35 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:04:39,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.69 | bwd_microstep: 279.72 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 278.63 | step_microstep: 3.56
[2025-11-06 18:04:39,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.73 | bwd: 280.57 | bwd_inner: 1.79 | bwd_allreduce: 278.66 | step: 3.62
 22%|██▏       | 784/3507 [19:53<1:09:25,  1.53s/it]                                                    {'loss': 0.5327, 'learning_rate': 1.8102098187478046e-05, 'epoch': 0.22}
 22%|██▏       | 784/3507 [19:53<1:09:25,  1.53s/it]tensor([[-1.2891, -0.4883,  1.4297,  3.8438, -0.4238]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5078, -0.4883,  1.5625,  2.9219, -0.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -2.2656,  0.8359, -0.3125, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -2.0000,  1.4375, -1.4062, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:39,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.73 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.5000, -4.0000, -2.1094,  1.1562, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -1.2266,  1.9922,  0.3320, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8828,  0.5820,  3.3594, -1.1562, -1.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -3.8125,  0.4355,  0.0747, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:40,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:04:40,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.32 | bwd_microstep: 744.99 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 743.81 | step_microstep: 1.68
[2025-11-06 18:04:40,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 475.08 | bwd: 745.87 | bwd_inner: 1.91 | bwd_allreduce: 743.84 | step: 1.75
 22%|██▏       | 785/3507 [19:54<1:05:49,  1.45s/it]                                                    {'loss': 0.2158, 'learning_rate': 1.8096680403517857e-05, 'epoch': 0.22}
 22%|██▏       | 785/3507 [19:54<1:05:49,  1.45s/it]tensor([[-0.9062,  0.7422,  3.1562,  2.5312, -0.4414]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.6719, -1.3750,  1.7969, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -2.7812, -0.2520,  1.0703, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -1.9922,  1.0938,  0.3203, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:40,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.03 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-4.0938, -1.9297,  1.6562, -0.0415, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.6094, -1.7031,  1.8516, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1875, -2.6875, -0.9023,  2.3750, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -2.9219,  0.4004,  0.6523, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:41,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:04:41,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.31 | bwd_microstep: 955.70 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 954.25 | step_microstep: 1.79
[2025-11-06 18:04:41,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.36 | bwd: 956.57 | bwd_inner: 2.14 | bwd_allreduce: 954.30 | step: 1.90
 22%|██▏       | 786/3507 [19:55<1:04:08,  1.41s/it]                                                    {'loss': 0.5156, 'learning_rate': 1.8091255710903593e-05, 'epoch': 0.22}
 22%|██▏       | 786/3507 [19:55<1:04:08,  1.41s/it]tensor([[-3.7656, -2.2344,  0.7344,  1.0156, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:41,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.89 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6875, -4.1875, -2.2500,  0.8242, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.6250, -2.9688, -0.8477,  2.3906, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -2.6094,  0.3242,  0.3223, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2812, -4.2812, -0.2441, -0.8164, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5781, -1.3672,  2.4219,  0.4746, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -2.9688,  0.7812,  0.5352, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3594, -0.2119,  2.7188, -0.3418, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:42,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:04:42,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.30 | bwd_microstep: 761.57 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 760.52 | step_microstep: 1.73
[2025-11-06 18:04:42,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.21 | bwd: 762.59 | bwd_inner: 1.87 | bwd_allreduce: 760.57 | step: 1.82
 22%|██▏       | 787/3507 [19:56<59:52,  1.32s/it]                                                    {'loss': 0.8839, 'learning_rate': 1.808582411426398e-05, 'epoch': 0.22}
 22%|██▏       | 787/3507 [19:56<59:52,  1.32s/it]tensor([[-3.3594, -2.7344, -0.7188,  2.4844, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -2.5156,  0.8672,  0.7539, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8906, -1.3750,  1.3438,  1.7812, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -1.6484,  1.5781, -1.4297, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:43,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.19 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.1250, -0.0273,  2.4531, -0.6484, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -2.4219,  0.7891,  0.3691, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.0625,  1.7734, -0.7031, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.1094,  0.2852,  0.5625, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:44,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.25
[2025-11-06 18:04:44,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 72.44 | bwd_microstep: 1117.67 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 1116.69 | step_microstep: 1.88
[2025-11-06 18:04:44,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 269.65 | bwd: 1118.61 | bwd_inner: 1.73 | bwd_allreduce: 1116.74 | step: 1.97
 22%|██▏       | 788/3507 [19:58<1:01:11,  1.35s/it]                                                    {'loss': 0.4101, 'learning_rate': 1.808038561823364e-05, 'epoch': 0.22}
 22%|██▏       | 788/3507 [19:58<1:01:11,  1.35s/it]tensor([[-1.3750,  0.9219,  3.9688,  0.8164, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:44,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.12 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7344, -3.1250, -1.3906,  1.1719, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -2.9375, -0.3594,  2.4531, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812, -0.2969,  2.2500,  0.0398, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -3.3281, -0.5586,  1.3047, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -2.9844,  0.2402, -2.4844, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4688, -2.6719, -0.6328,  1.4766, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7891,  0.6875,  3.8750, -0.3379, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:44,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:04:44,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.25 | bwd_microstep: 69.76 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 68.60 | step_microstep: 1.97
[2025-11-06 18:04:44,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.40 | bwd: 70.62 | bwd_inner: 1.83 | bwd_allreduce: 68.64 | step: 2.07
 22%|██▏       | 789/3507 [19:58<48:47,  1.08s/it]                                                    {'loss': 0.1215, 'learning_rate': 1.8074940227453074e-05, 'epoch': 0.22}
 22%|██▏       | 789/3507 [19:58<48:47,  1.08s/it]tensor([[-4.9688, -4.3750, -2.0469,  1.5781, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4375, -1.5469,  1.2188, -0.3652, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:04:44,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.2188, -2.2188,  1.4375,  0.7461, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -2.5625,  0.8984,  0.5625, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -4.0625, -0.7891,  0.7617, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -4.8438, -1.5391,  0.0908, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -3.5469,  0.5273, -0.5273, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1406, -2.6094, -0.8906,  1.8906, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:46,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.26
[2025-11-06 18:04:46,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 335.09 | bwd_microstep: 757.48 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 756.39 | step_microstep: 2.51
[2025-11-06 18:04:46,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 513.54 | bwd: 758.41 | bwd_inner: 1.85 | bwd_allreduce: 756.43 | step: 2.59
 23%|██▎       | 790/3507 [19:59<52:03,  1.15s/it]                                                  {'loss': 0.5002, 'learning_rate': 1.8069487946568675e-05, 'epoch': 0.23}
 23%|██▎       | 790/3507 [19:59<52:03,  1.15s/it]tensor([[-4.2812, -2.7031,  0.5391,  1.0625, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:46,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.12 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5625, -3.0000,  0.1099,  0.5469, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -2.6094,  0.8867,  0.9766, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -4.0625, -0.7852,  1.4141, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -4.2500, -1.2344, -0.7461, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -0.7617,  2.0781, -1.9766, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.0625, -3.2969, -1.1484,  1.3125, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -3.4219,  0.8516, -0.9336, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:46,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:04:46,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.48 | bwd_microstep: 2.22 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.98 | step_microstep: 1.86
[2025-11-06 18:04:46,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.63 | bwd: 3.12 | bwd_inner: 1.92 | bwd_allreduce: 1.03 | step: 1.95
 23%|██▎       | 791/3507 [20:00<43:35,  1.04it/s]                                                  {'loss': 0.7568, 'learning_rate': 1.8064028780232702e-05, 'epoch': 0.23}
 23%|██▎       | 791/3507 [20:00<43:35,  1.04it/s]tensor([[-3.0312, -0.6367,  2.7188, -0.3105, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -4.0312, -2.3906,  0.9414, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -1.6172,  1.3672,  0.5781, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7500, -3.5469, -2.1875,  1.5938, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:46,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.14 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.0781,  0.1367,  2.5625, -0.9648, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.7344, -1.2109,  1.5234,  1.8516, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -2.7031, -0.7070,  2.3438, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0781, -2.2344, -0.1143,  2.1406, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:48,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:04:48,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.04 | bwd_microstep: 1683.06 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1682.03 | step_microstep: 2.36
[2025-11-06 18:04:48,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.21 | bwd: 1683.90 | bwd_inner: 1.66 | bwd_allreduce: 1682.08 | step: 2.45
 23%|██▎       | 792/3507 [20:02<1:00:42,  1.34s/it]                                                    {'loss': 0.5386, 'learning_rate': 1.805856273310331e-05, 'epoch': 0.23}
 23%|██▎       | 792/3507 [20:02<1:00:42,  1.34s/it]tensor([[-2.1875, -0.1875,  2.0469, -0.9648, -1.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.5312, -4.0312, -2.0625,  1.0234, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -1.5391,  0.7578,  1.6172, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -1.7734,  1.8906,  0.1631, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:49,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.08 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5000, -1.5078,  1.2656, -1.0859, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1562,  0.1299,  2.7500, -1.1016, -1.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -2.6250, -0.1074,  2.2812, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469, -1.2344,  1.2969,  2.1719, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:49,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:04:49,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.73 | bwd_microstep: 1.97 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.89
[2025-11-06 18:04:49,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.85 | bwd: 2.99 | bwd_inner: 1.94 | bwd_allreduce: 0.90 | step: 1.97
 23%|██▎       | 793/3507 [20:03<48:00,  1.06s/it]                                                    {'loss': 0.7807, 'learning_rate': 1.80530898098445e-05, 'epoch': 0.23}
 23%|██▎       | 793/3507 [20:03<48:00,  1.06s/it]tensor([[-5.0625, -3.4062,  0.1309,  0.9102, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -2.6719, -0.5312,  1.9531, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156, -0.6250,  2.8750, -0.4980, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4141, -0.2441,  1.9062,  2.9844, -0.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:49,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.93 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.2188, -0.8047,  2.2500, -1.4141, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.0000, -1.6250,  1.9531, -0.9688, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -2.2969,  0.4961, -2.0469, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.3125, -2.8594,  0.3398,  1.2344, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:52,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:04:52,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.21 | bwd_microstep: 2393.84 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 2392.58 | step_microstep: 1.62
[2025-11-06 18:04:52,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.16 | bwd: 2394.84 | bwd_inner: 2.09 | bwd_allreduce: 2392.63 | step: 1.71
 23%|██▎       | 794/3507 [20:06<1:14:24,  1.65s/it]                                                    {'loss': 0.9308, 'learning_rate': 1.804761001512616e-05, 'epoch': 0.23}
 23%|██▎       | 794/3507 [20:06<1:14:24,  1.65s/it]tensor([[-3.8594, -1.3516,  2.2031, -1.1719, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:52,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.42 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.5938, -1.9688, -0.0583,  3.1562, -1.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -4.9375, -1.4844,  0.6719, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7227,  0.1826,  1.9766,  3.6875, -0.0564]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6875, -0.8672,  3.0312, -1.3594, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5781, -0.0874,  3.0625, -0.8086, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250, -2.1719,  0.6602,  1.3359, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -3.0000, -0.8594,  1.8281, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:04:52,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.14
[2025-11-06 18:04:52,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.02 | bwd_microstep: 239.44 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 238.41 | step_microstep: 1.50
[2025-11-06 18:04:52,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.46 | bwd: 240.47 | bwd_inner: 1.92 | bwd_allreduce: 238.44 | step: 1.57
 23%|██▎       | 795/3507 [20:06<59:45,  1.32s/it]                                                    {'loss': 0.2229, 'learning_rate': 1.8042123353624032e-05, 'epoch': 0.23}
 23%|██▎       | 795/3507 [20:06<59:45,  1.32s/it]tensor([[-4.9688, -2.5625,  1.5469, -0.4902, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -1.5078,  1.8281, -1.4141, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -2.6094,  0.3359,  1.2188, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:53,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5859,  0.2559,  2.9844,  1.2812, -1.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -3.1875, -1.2109,  2.5781, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.3594, -0.2812,  1.2109, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -1.4922,  2.3906, -1.4375, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0312, -2.3750,  0.7383,  0.8242, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:54,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:04:54,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.56 | bwd_microstep: 1310.64 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 1309.32 | step_microstep: 2.27
[2025-11-06 18:04:54,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.24 | bwd: 1311.62 | bwd_inner: 2.09 | bwd_allreduce: 1309.37 | step: 2.34
 23%|██▎       | 796/3507 [20:08<1:05:56,  1.46s/it]                                                    {'loss': 0.4825, 'learning_rate': 1.803662983001972e-05, 'epoch': 0.23}
 23%|██▎       | 796/3507 [20:08<1:05:56,  1.46s/it]tensor([[-4.2812, -3.0000, -0.1162,  1.4219, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9922, -0.4082,  2.2500,  1.8203, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -1.2344,  2.4531, -0.5898, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -3.4844, -0.4258, -0.8828, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:04:54,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 250.46 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7500, -0.4863,  2.3906, -1.1328, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.9688, -2.0781, -0.1118,  1.7500, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -3.4531,  0.3984, -0.8242, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2500, -0.5508,  1.1328,  3.6719, -0.3887]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:55,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:04:55,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.18 | bwd_microstep: 43.48 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 42.17 | step_microstep: 1.53
[2025-11-06 18:04:55,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.66 | bwd: 44.33 | bwd_inner: 2.00 | bwd_allreduce: 42.21 | step: 1.60
 23%|██▎       | 797/3507 [20:08<52:36,  1.16s/it]                                                    {'loss': 0.6192, 'learning_rate': 1.8031129449000687e-05, 'epoch': 0.23}
 23%|██▎       | 797/3507 [20:08<52:36,  1.16s/it]tensor([[-4.5312, -3.6094, -1.1719,  1.3750, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719, -1.2109,  2.3750, -0.3672, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -4.0625, -2.2188,  0.3477, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:55,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.02 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5000, -2.5312, -0.1367,  2.1094, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -2.8594, -0.6094,  1.7500, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7188, -2.8906, -0.5469,  2.3438, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -2.0312,  1.6250,  0.4648, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -2.0625,  1.4922,  0.2871, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:04:56,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:04:56,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.59 | bwd_microstep: 752.39 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 750.80 | step_microstep: 1.88
[2025-11-06 18:04:56,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.64 | bwd: 753.40 | bwd_inner: 2.42 | bwd_allreduce: 750.84 | step: 1.97
 23%|██▎       | 798/3507 [20:09<51:38,  1.14s/it]                                                  {'loss': 0.4946, 'learning_rate': 1.8025622215260236e-05, 'epoch': 0.23}
 23%|██▎       | 798/3507 [20:09<51:38,  1.14s/it]tensor([[-3.9531, -1.7344,  1.4609, -0.8711, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -2.0781, -0.2734,  2.0781, -1.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -3.4844, -1.7891,  2.0625, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:56,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.89 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.18
tensor([[-5.1562, -2.6094,  1.4844, -1.4297, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -3.8594, -0.4805,  1.5078, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8438, -1.1719,  0.6562,  3.3438, -0.8711]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -4.1562, -1.4453,  1.1641, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -3.5625,  0.4531, -1.5312, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:04:56,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:04:56,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.01 | bwd_microstep: 185.97 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 184.72 | step_microstep: 1.51
[2025-11-06 18:04:56,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.92 | bwd: 186.98 | bwd_inner: 2.10 | bwd_allreduce: 184.76 | step: 1.69
 23%|██▎       | 799/3507 [20:10<42:50,  1.05it/s]                                                  {'loss': 0.1065, 'learning_rate': 1.8020108133497528e-05, 'epoch': 0.23}
 23%|██▎       | 799/3507 [20:10<42:50,  1.05it/s]tensor([[-2.1250, -0.3301,  1.6797, -0.9336, -1.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.1250, -2.7344,  1.0078, -1.6172, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -2.1875,  0.0718,  2.8750, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.2031,  0.4453, -0.4766, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:56,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.05 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5312, -3.1875, -0.3223,  0.9727, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6094, -1.2734,  0.8867,  0.6406, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -2.0469,  1.4922, -0.3594, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -4.6250, -2.3438,  0.7461, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:04:58,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.25 | optimizer_step: 0.20
[2025-11-06 18:04:58,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.16 | bwd_microstep: 1121.25 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1119.99 | step_microstep: 2.17
[2025-11-06 18:04:58,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.25 | bwd: 1122.22 | bwd_inner: 2.02 | bwd_allreduce: 1120.04 | step: 2.26
 23%|██▎       | 800/3507 [20:12<51:01,  1.13s/it]                                                  {'loss': 0.6417, 'learning_rate': 1.801458720841756e-05, 'epoch': 0.23}
 23%|██▎       | 800/3507 [20:12<51:01,  1.13s/it]tensor([[-4.7812, -2.7188,  0.5664, -1.0469, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2656, -0.2676,  1.7344, -1.0234, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.6562, -2.3281,  0.3359,  1.3672, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:04:58,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.15 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7344, -3.2969, -1.6953,  1.3047, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750, -0.9727,  2.5938, -0.4863, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5312, -0.4766,  2.1406, -0.7617, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -3.9062, -0.9805,  0.8789, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5625, -3.3594,  0.8516,  0.1040, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:04:59,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:04:59,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.52 | bwd_microstep: 296.44 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 295.46 | step_microstep: 1.70
[2025-11-06 18:04:59,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.67 | bwd: 297.60 | bwd_inner: 1.96 | bwd_allreduce: 295.50 | step: 1.78
 23%|██▎       | 801/3507 [20:13<51:06,  1.13s/it]                                                  {'loss': 0.5505, 'learning_rate': 1.800905944473117e-05, 'epoch': 0.23}
 23%|██▎       | 801/3507 [20:13<51:06,  1.13s/it]tensor([[-4.7188, -3.1719, -0.0537,  0.3965, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -2.1562,  0.7852,  1.6484, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -3.0156, -0.7461,  2.4844, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -1.7344,  1.3594,  0.2227, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0625,  0.3047,  3.1094, -0.3730, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000, -0.5469,  1.8828, -0.0981, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:00,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.16 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6719, -2.9375, -0.7188,  2.3594, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938, -3.0469, -0.4004,  2.1875, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:05:01,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:05:01,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 1093.13 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1091.84 | step_microstep: 1.89
[2025-11-06 18:05:01,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.69 | bwd: 1094.07 | bwd_inner: 2.04 | bwd_allreduce: 1091.89 | step: 1.98
 23%|██▎       | 802/3507 [20:15<1:04:45,  1.44s/it]                                                    {'loss': 0.3276, 'learning_rate': 1.8003524847155042e-05, 'epoch': 0.23}
 23%|██▎       | 802/3507 [20:15<1:04:45,  1.44s/it]tensor([[-7.0000, -5.9375, -3.0312, -0.5352, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:01,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.02 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.6719, -1.4453,  1.5234, -0.8789, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -4.0625, -2.3750,  0.8047, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5625, -2.8438, -0.6875,  2.2344, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0469, -1.0859,  1.7344,  0.0197, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6875, -1.6562,  1.7656,  0.7500, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -4.7188, -1.7656,  0.8047, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.9375, -5.3125, -3.0156,  0.1182, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([3], device='cuda:3')
[2025-11-06 18:05:01,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:05:01,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.22 | bwd_microstep: 112.20 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 110.84 | step_microstep: 1.55
[2025-11-06 18:05:01,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 245.25 | bwd: 113.42 | bwd_inner: 2.39 | bwd_allreduce: 110.89 | step: 1.67
 23%|██▎       | 803/3507 [20:15<50:34,  1.12s/it]                                                    {'loss': 0.1385, 'learning_rate': 1.7997983420411674e-05, 'epoch': 0.23}
 23%|██▎       | 803/3507 [20:15<50:34,  1.12s/it]tensor([[-2.9531, -0.5352,  2.3281, -1.3047, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -1.5625,  1.3984,  0.8633, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.3438, 2.9844, 4.8125, 3.2188, 1.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5469, -2.2344, -0.8438,  2.6719, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9688, -1.9141,  0.4980,  2.3594, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5312, -2.1719, -0.9805,  1.7734, -1.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:03,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.26 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0625, -2.8281, -0.0791,  1.6875, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8125, -2.9375, -0.7266,  1.5859, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:05:04,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:05:04,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.40 | bwd_microstep: 1131.41 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 1130.13 | step_microstep: 1.87
[2025-11-06 18:05:04,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.67 | bwd: 1132.52 | bwd_inner: 2.19 | bwd_allreduce: 1130.19 | step: 1.97
 23%|██▎       | 804/3507 [20:18<1:10:54,  1.57s/it]                                                    {'loss': 0.269, 'learning_rate': 1.7992435169229404e-05, 'epoch': 0.23}
 23%|██▎       | 804/3507 [20:18<1:10:54,  1.57s/it]tensor([[-3.2656, -2.5938, -0.6445,  2.0312, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -2.7500, -0.2949,  2.1094, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3594, -0.5820,  1.6484, -0.3809, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:04,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.28 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -2.6094,  0.7070,  1.1562, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.2188,  0.7852, -0.7656, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1250, -0.9961,  1.8359, -0.2559, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7656, -0.2949,  2.8750, -1.0000, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906, -1.6953,  0.9922,  0.1748, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:05:05,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:05:05,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.10 | bwd_microstep: 88.75 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 87.65 | step_microstep: 1.75
[2025-11-06 18:05:05,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 421.40 | bwd: 89.70 | bwd_inner: 1.89 | bwd_allreduce: 87.69 | step: 1.84
 23%|██▎       | 805/3507 [20:18<57:03,  1.27s/it]                                                    {'loss': 0.2385, 'learning_rate': 1.798688009834238e-05, 'epoch': 0.23}
 23%|██▎       | 805/3507 [20:18<57:03,  1.27s/it]tensor([[-2.2500,  0.0366,  2.5312, -0.8086, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -3.5156, -1.4141,  1.9688, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -1.7578,  1.9375, -0.6758, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2812, -1.0703,  1.2734,  2.4375, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.1719, -0.2676,  0.5625, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4375, -0.9219,  1.6875,  1.6875, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:06,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.44 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8438, -2.9844, -1.0000,  0.8398, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8594, -1.4609,  1.3203, -1.8672, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:05:07,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:05:07,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.90 | bwd_microstep: 607.53 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 606.28 | step_microstep: 1.78
[2025-11-06 18:05:07,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.38 | bwd: 608.51 | bwd_inner: 2.02 | bwd_allreduce: 606.33 | step: 1.88
 23%|██▎       | 806/3507 [20:20<1:06:16,  1.47s/it]                                                    {'loss': 0.6038, 'learning_rate': 1.7981318212490584e-05, 'epoch': 0.23}
 23%|██▎       | 806/3507 [20:20<1:06:16,  1.47s/it]tensor([[-5.0938, -3.5156, -0.2285,  0.7734, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -2.0938,  1.3750,  0.5508, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:07,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.93 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.1250,  0.4102,  2.9844, -1.3750, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -3.2812, -1.4922,  1.3672, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -4.0625, -2.1250,  1.0312, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.2344,  0.1445,  0.6797, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7969, -1.1094,  2.6250, -1.1875, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.5469,  1.5000, -0.7031, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:05:07,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:05:07,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.66 | bwd_microstep: 177.89 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 176.65 | step_microstep: 1.79
[2025-11-06 18:05:07,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.60 | bwd: 179.07 | bwd_inner: 2.23 | bwd_allreduce: 176.70 | step: 1.88
 23%|██▎       | 807/3507 [20:21<55:04,  1.22s/it]                                                    {'loss': 0.2047, 'learning_rate': 1.79757495164198e-05, 'epoch': 0.23}
 23%|██▎       | 807/3507 [20:21<55:04,  1.22s/it]tensor([[-3.2188, -1.3984,  1.4688,  0.2344, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -2.3906,  1.3359, -0.5352, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188, -2.4688, -0.4824,  1.8438, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8984, -1.5703, -0.0457,  3.6562, -0.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:07,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.73 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3750, -2.5000, -0.4961,  1.2109, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.0938, -3.0469, -0.4180,  1.6875, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -1.4688,  1.2500, -0.8516, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875, -1.7109,  1.3828,  0.6289, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:10,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:05:10,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.75 | bwd_microstep: 2162.43 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2161.38 | step_microstep: 1.97
[2025-11-06 18:05:10,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.50 | bwd: 2163.28 | bwd_inner: 1.72 | bwd_allreduce: 2161.43 | step: 2.05
 23%|██▎       | 808/3507 [20:23<1:12:21,  1.61s/it]                                                    {'loss': 0.9045, 'learning_rate': 1.797017401488164e-05, 'epoch': 0.23}
 23%|██▎       | 808/3507 [20:23<1:12:21,  1.61s/it]tensor([[-2.5156, -2.1562, -1.1562,  1.1719, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-1.8672, -0.8008,  1.2656,  2.4219, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:10,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.40 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4531, -2.6719, -0.2734,  2.8125, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7656, -2.2344, -0.5039,  2.4844, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.0625, -1.4297,  1.3828, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -2.3906,  0.4004,  1.0156, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -3.2344, -1.2734,  2.3750, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -3.2344, -0.3672,  0.6797, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:05:10,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:05:10,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.94 | bwd_microstep: 170.88 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 169.87 | step_microstep: 1.67
[2025-11-06 18:05:10,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.36 | bwd: 171.89 | bwd_inner: 1.85 | bwd_allreduce: 169.91 | step: 1.75
 23%|██▎       | 809/3507 [20:24<58:26,  1.30s/it]                                                    {'loss': 0.6609, 'learning_rate': 1.79645917126335e-05, 'epoch': 0.23}
 23%|██▎       | 809/3507 [20:24<58:26,  1.30s/it]tensor([[-2.1719, -0.7422,  1.2188,  0.7383, -1.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5469, -0.8477,  2.8125, -0.9766, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:10,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.58 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -2.3906,  1.4453,  0.6523, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.1719, -0.7656,  1.2812, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -1.6328,  1.4062,  0.4629, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375, -0.2432,  2.6250, -2.0781, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7266, -0.4219,  2.0781,  2.9531, -0.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8750, -0.9844,  1.8203,  0.6914, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:12,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:05:12,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.29 | bwd_microstep: 896.17 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 895.02 | step_microstep: 1.65
[2025-11-06 18:05:12,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.90 | bwd: 897.06 | bwd_inner: 1.86 | bwd_allreduce: 895.06 | step: 1.73
 23%|██▎       | 810/3507 [20:25<58:11,  1.29s/it]                                                  {'loss': 0.409, 'learning_rate': 1.7959002614438595e-05, 'epoch': 0.23}
 23%|██▎       | 810/3507 [20:25<58:11,  1.29s/it]tensor([[3.7969, 4.1875, 4.5312, 6.2812, 3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -2.0938,  0.6016,  1.6250, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -1.8125,  1.2656,  0.6484, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:12,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.42 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.7500, -3.2500, -1.2812,  2.2500, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2969,  0.8242,  2.6562, -1.1641, -1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.6250, -0.8906,  1.7969,  1.0078, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -2.0000,  1.1641,  0.2637, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406, -2.3281, -0.1387,  2.4062, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:05:12,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:05:12,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.45 | bwd_microstep: 90.04 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 89.06 | step_microstep: 1.64
[2025-11-06 18:05:12,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.90 | bwd: 90.90 | bwd_inner: 1.70 | bwd_allreduce: 89.09 | step: 1.71
 23%|██▎       | 811/3507 [20:26<46:51,  1.04s/it]                                                  {'loss': 0.7516, 'learning_rate': 1.7953406725065942e-05, 'epoch': 0.23}
 23%|██▎       | 811/3507 [20:26<46:51,  1.04s/it]tensor([[-2.7500, -1.6094,  0.6641,  1.8438, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -2.2500,  1.3203,  1.3984, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -4.0000, -1.5781,  0.9648, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:12,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.64 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.3750, -1.8359,  0.8828,  1.0625, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.0625, -2.7344, -0.0391,  1.1797, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([2], device='cuda:0')
 tensor([3], device='cuda:2')
tensor([[-4.3125, -2.6094,  0.3008,  0.1484, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -1.5859,  1.7109, -1.4062, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -2.1406,  0.7383,  0.8438, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:05:15,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:05:15,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.00 | bwd_microstep: 2347.23 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2346.01 | step_microstep: 2.11
[2025-11-06 18:05:15,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.66 | bwd: 2348.01 | bwd_inner: 1.84 | bwd_allreduce: 2346.05 | step: 2.18
 23%|██▎       | 812/3507 [20:29<1:09:42,  1.55s/it]                                                    {'loss': 0.477, 'learning_rate': 1.794780404929033e-05, 'epoch': 0.23}
 23%|██▎       | 812/3507 [20:29<1:09:42,  1.55s/it]tensor([[-0.9922,  1.2734,  3.2500, -0.2100, -0.9805]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:05:15,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.48 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.4219, -1.3750,  1.8438,  0.3223, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.6406, -1.7500,  1.1641, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.7500, -3.1719,  1.2266, -0.9531, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -5.6562, -2.9688,  0.1240, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -3.3438,  0.5898, -2.7969, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -3.6094,  0.3320,  0.4023, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[3.3750, 4.5625, 4.9062, 2.8906, 2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:05:15,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:05:15,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.23 | bwd_microstep: 26.62 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 25.57 | step_microstep: 2.02
[2025-11-06 18:05:15,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.73 | bwd: 27.70 | bwd_inner: 1.93 | bwd_allreduce: 25.61 | step: 2.10
 23%|██▎       | 813/3507 [20:29<54:24,  1.21s/it]                                                    {'loss': 1.0707, 'learning_rate': 1.7942194591892366e-05, 'epoch': 0.23}
 23%|██▎       | 813/3507 [20:29<54:24,  1.21s/it]tensor([[-2.2344, -0.2354,  1.9844, -0.9258, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281, -2.8594, -1.1484,  2.0000, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4922, -1.3516, -0.2598,  3.4062, -0.4980]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -1.8906,  1.4688,  0.2812, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:15,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.45 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9219, -2.2188,  0.7734,  0.6875, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -1.5078,  2.1875, -0.2256, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9062, -3.1250,  0.4629,  0.8711, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -1.3281,  2.2500, -0.7031, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:17,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.38
[2025-11-06 18:05:17,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.66 | bwd_microstep: 1553.59 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1552.42 | step_microstep: 2.58
[2025-11-06 18:05:17,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.12 | bwd: 1554.47 | bwd_inner: 1.83 | bwd_allreduce: 1552.49 | step: 2.67
 23%|██▎       | 814/3507 [20:31<1:04:13,  1.43s/it]                                                    {'loss': 0.567, 'learning_rate': 1.793657835765843e-05, 'epoch': 0.23}
 23%|██▎       | 814/3507 [20:31<1:04:13,  1.43s/it]tensor([[-3.7812, -1.5781,  1.7188, -0.1465, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -1.8281,  1.5000, -0.2412, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:17,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.23 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.4688, -2.9062, -1.0938,  2.0469, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -3.7812, -1.7188,  0.8906, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -2.8594, -1.1641,  1.5469, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.6562, -2.6719, -0.1021,  2.6094, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281, -0.3398,  2.9062, -0.8047, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875, -2.6094, -0.3633,  0.9375, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:05:17,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.16 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:05:17,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.59 | bwd_microstep: 143.84 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 142.67 | step_microstep: 3.16
[2025-11-06 18:05:17,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 247.83 | bwd: 144.82 | bwd_inner: 1.94 | bwd_allreduce: 142.73 | step: 3.27
 23%|██▎       | 815/3507 [20:31<50:39,  1.13s/it]                                                    {'loss': 0.6011, 'learning_rate': 1.793095535138068e-05, 'epoch': 0.23}
 23%|██▎       | 815/3507 [20:31<50:39,  1.13s/it]tensor([[-3.4062, -2.3906, -0.0105,  1.9609, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:18,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.04 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5938, -3.0938, -0.0830,  1.1094, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375, -1.5391,  1.6406,  1.0156, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -3.7031, -0.6211,  1.0859, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -2.9375, -0.6016,  2.6406, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2188, -4.5625, -1.2578, -0.7500, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -2.7969, -0.1924,  1.0391, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -0.9102,  1.6406, -1.6562, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:19,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:05:19,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.88 | bwd_microstep: 1243.23 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 1242.26 | step_microstep: 2.26
[2025-11-06 18:05:19,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.94 | bwd: 1244.13 | bwd_inner: 1.69 | bwd_allreduce: 1242.30 | step: 2.33
 23%|██▎       | 816/3507 [20:33<56:56,  1.27s/it]                                                  {'loss': 0.258, 'learning_rate': 1.7925325577857062e-05, 'epoch': 0.23}
 23%|██▎       | 816/3507 [20:33<56:56,  1.27s/it]tensor([[-3.9219, -1.3125,  2.1719, -1.7422, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9375, -1.1875,  1.3594,  0.3105, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0469, -2.0625,  0.3320,  2.3750, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -3.4531, -0.7305,  1.0312, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:19,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.34 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8281, -2.8750, -0.4395,  1.6484, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688, -2.1406,  0.4688,  1.0625, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188, -1.1719,  1.7344,  0.2070, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9844, -0.4180,  2.5938, -1.4531, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:20,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:05:20,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.93 | bwd_microstep: 36.33 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 34.82 | step_microstep: 1.69
[2025-11-06 18:05:20,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.30 | bwd: 37.22 | bwd_inner: 2.24 | bwd_allreduce: 34.85 | step: 1.78
 23%|██▎       | 817/3507 [20:33<46:06,  1.03s/it]                                                  {'loss': 0.291, 'learning_rate': 1.7919689041891292e-05, 'epoch': 0.23}
 23%|██▎       | 817/3507 [20:33<46:06,  1.03s/it]tensor([[-6.9688, -5.5000, -2.2812, -0.8633, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[4.0625, 4.7812, 5.5625, 6.5938, 3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:20,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.91 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7500, -1.9844,  1.1797,  0.9492, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -1.5156,  2.1094,  0.2578, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8281, -0.5977,  1.7656, -1.3750, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6250, -0.8047,  1.6406,  0.0334, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5391, -0.3340,  1.8984,  3.0312, -0.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.5000,  0.9844,  1.6250, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:21,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:05:21,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.69 | bwd_microstep: 1289.22 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1288.19 | step_microstep: 1.74
[2025-11-06 18:05:21,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.62 | bwd: 1290.44 | bwd_inner: 2.08 | bwd_allreduce: 1288.23 | step: 1.82
 23%|██▎       | 818/3507 [20:35<54:24,  1.21s/it]                                                  {'loss': 0.4521, 'learning_rate': 1.7914045748292858e-05, 'epoch': 0.23}
 23%|██▎       | 818/3507 [20:35<54:24,  1.21s/it]tensor([[-4.1250, -2.4062,  0.5508,  0.3320, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -2.5938,  0.4512,  0.7109, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.5781,  0.3379,  0.8750, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -2.6562, -0.4082,  2.0781, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:22,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.79 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0312, -2.7656,  0.0447,  1.4688, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -2.4688,  0.8633,  1.4531, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -2.6250,  0.8555,  0.5781, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -2.3125,  0.2295,  1.9453, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:05:22,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:05:22,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.62 | bwd_microstep: 545.70 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 544.44 | step_microstep: 1.68
[2025-11-06 18:05:22,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.39 | bwd: 546.42 | bwd_inner: 1.82 | bwd_allreduce: 544.47 | step: 1.75
 23%|██▎       | 819/3507 [20:36<55:11,  1.23s/it]                                                  {'loss': 0.451, 'learning_rate': 1.7908395701877012e-05, 'epoch': 0.23}
 23%|██▎       | 819/3507 [20:36<55:11,  1.23s/it]tensor([[2.1250, 3.5156, 4.5000, 2.3906, 1.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5000, -2.0938,  1.8125, -0.0625, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1875, -2.8750, -1.5312,  1.5625, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9844, -2.5312, -0.9102,  2.3750, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:23,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.75 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9375, -2.2031,  1.0547,  1.2500, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -1.5781,  0.9727, -0.1279, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -1.1953,  2.1250, -0.4570, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6094, -1.4922,  0.6328,  1.4297, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:24,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:05:24,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.89 | bwd_microstep: 864.00 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 862.95 | step_microstep: 2.15
[2025-11-06 18:05:24,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 410.66 | bwd: 864.86 | bwd_inner: 1.75 | bwd_allreduce: 862.99 | step: 2.24
 23%|██▎       | 820/3507 [20:38<56:16,  1.26s/it]                                                  {'loss': 0.5448, 'learning_rate': 1.790273890746477e-05, 'epoch': 0.23}
 23%|██▎       | 820/3507 [20:38<56:16,  1.26s/it]tensor([[-5.2500, -4.5938, -2.0938,  1.5781, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -2.3750,  1.0156,  1.3359, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -1.6797,  1.1484,  0.3027, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -2.9219, -1.4453,  1.6953, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -3.5000, -1.4766,  1.7500, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:25,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.32 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7500, -0.5977,  1.7891, -0.8125, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.3750, -3.6719, -1.3594,  1.8750, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -2.1562,  0.9844,  1.2422, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:05:26,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.29 | optimizer_step: 0.25
[2025-11-06 18:05:26,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.15 | bwd_microstep: 769.75 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 768.76 | step_microstep: 3.06
[2025-11-06 18:05:26,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.48 | bwd: 770.66 | bwd_inner: 1.65 | bwd_allreduce: 768.81 | step: 3.14
 23%|██▎       | 821/3507 [20:39<1:02:50,  1.40s/it]                                                    {'loss': 0.6539, 'learning_rate': 1.7897075369882903e-05, 'epoch': 0.23}
 23%|██▎       | 821/3507 [20:39<1:02:50,  1.40s/it]tensor([[-4.7500, -2.6875,  0.7930, -0.1152, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0000, -0.7930,  1.8359, -0.9570, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:26,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.86 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.2812, -0.6680,  1.8203,  1.5000, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0312, -0.9766,  1.9375,  0.5195, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5312, -0.0525,  2.9375, -0.7422, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -3.8594, -2.1875,  0.8906, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -3.8125, -0.2852,  0.8047, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -3.9062, -1.8438,  1.7812, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:05:26,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.50 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:05:26,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.11 | bwd_microstep: 395.53 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 394.39 | step_microstep: 4.10
[2025-11-06 18:05:26,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.99 | bwd: 396.53 | bwd_inner: 1.96 | bwd_allreduce: 394.43 | step: 4.19
 23%|██▎       | 822/3507 [20:40<54:55,  1.23s/it]                                                    {'loss': 0.4166, 'learning_rate': 1.789140509396394e-05, 'epoch': 0.23}
 23%|██▎       | 822/3507 [20:40<54:55,  1.23s/it]tensor([[-4.1250, -3.1719, -0.5977,  1.8047, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -2.0781,  0.1924,  2.2969, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -3.4062,  0.4004,  0.7344, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:27,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.23 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9531, -0.3496,  2.7500, -1.3516, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -3.6250, -1.2500,  1.2734, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -2.9688, -0.6328,  1.9766, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.2969, -0.2402,  1.3828, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -2.1875,  0.1992,  2.0156, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:29,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:05:29,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.34 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.66
[2025-11-06 18:05:29,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.58 | bwd: 3.05 | bwd_inner: 1.97 | bwd_allreduce: 0.94 | step: 2.75
 23%|██▎       | 823/3507 [20:42<1:08:41,  1.54s/it]                                                    {'loss': 0.1795, 'learning_rate': 1.788572808454615e-05, 'epoch': 0.23}
 23%|██▎       | 823/3507 [20:42<1:08:41,  1.54s/it]tensor([[-3.4688, -1.7422,  1.2422,  0.9258, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4844, -0.3730,  2.0469, -0.7500, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6562, -2.5000,  0.1250,  1.8516, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:29,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.20 | bwd_microstep: 1.14 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-3.7031, -1.6172,  1.0781, -0.9531, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7500, -2.3750, -0.9180,  2.4062, -1.4922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.2031,  0.8438,  0.3730, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -2.4688, -0.4707,  1.7734, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9844, -1.6641,  1.0078,  2.1562, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:05:29,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.21 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:05:29,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.38 | bwd_microstep: 69.72 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 68.47 | step_microstep: 3.10
[2025-11-06 18:05:29,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.59 | bwd: 70.87 | bwd_inner: 2.11 | bwd_allreduce: 68.55 | step: 3.24
 23%|██▎       | 824/3507 [20:43<55:02,  1.23s/it]                                                    {'loss': 0.5482, 'learning_rate': 1.788004434647356e-05, 'epoch': 0.23}
 23%|██▎       | 824/3507 [20:43<55:02,  1.23s/it]tensor([[-2.8750, -2.5000, -1.0859,  1.8516, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[2.1250, 3.5469, 4.8125, 3.9531, 1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -2.7344,  0.3965,  0.0889, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -1.6250,  1.5234, -0.0405, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:30,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.08 | bwd_microstep: 1.35 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.16
tensor([[-3.1562, -1.2188,  1.4531,  0.2197, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4531, -1.8906,  0.9336,  1.0938, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.1875, 2.8594, 4.5625, 3.0781, 1.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5312, -1.2266,  1.3203,  2.4844, -1.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:31,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:05:31,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.83 | bwd_microstep: 1.74 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.01
[2025-11-06 18:05:31,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.92 | bwd: 3.08 | bwd_inner: 1.98 | bwd_allreduce: 0.93 | step: 2.18
 24%|██▎       | 825/3507 [20:45<1:06:19,  1.48s/it]                                                    {'loss': 0.4229, 'learning_rate': 1.7874353884595935e-05, 'epoch': 0.24}
 24%|██▎       | 825/3507 [20:45<1:06:19,  1.48s/it]tensor([[-4.1875, -2.1094,  1.5469,  0.6445, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5781, -2.1719, -0.7617,  2.2656, -1.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -2.2812, -0.6094,  2.8750, -1.4453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:31,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.40 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2812, -4.7812, -2.4844,  1.5781, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -4.8438, -2.3750,  0.6289, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4062, -0.9609,  2.0156, -1.3125, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4141,  0.6406,  2.6875, -0.5820, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2812, -0.8359,  2.4219, -0.9258, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:32,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:05:32,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 329.16 | bwd_microstep: 1.88 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.89 | step_microstep: 1.81
[2025-11-06 18:05:32,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 489.59 | bwd: 2.54 | bwd_inner: 1.47 | bwd_allreduce: 0.92 | step: 1.89
 24%|██▎       | 826/3507 [20:46<53:36,  1.20s/it]                                                    {'loss': 0.4879, 'learning_rate': 1.7868656703768773e-05, 'epoch': 0.24}
 24%|██▎       | 826/3507 [20:46<53:36,  1.20s/it]tensor([[-4.4688, -2.7344,  0.3418,  0.0659, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:32,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.39 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.22
tensor([[-4.1250, -2.0781,  1.4219,  0.3125, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -2.9375, -0.2695,  1.6094, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -1.0938,  1.6641,  0.7031, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -3.1719, -0.2334,  1.2188, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.6875,  1.1016,  0.3359, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -4.2188, -2.4531,  1.2109, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -1.8750,  1.6797,  0.0354, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:34,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.27 | optimizer_step: 0.22
[2025-11-06 18:05:34,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.99 | bwd_microstep: 2.46 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1.26 | step_microstep: 2.50
[2025-11-06 18:05:34,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.40 | bwd: 3.41 | bwd_inner: 1.90 | bwd_allreduce: 1.32 | step: 2.72
 24%|██▎       | 827/3507 [20:48<1:12:29,  1.62s/it]                                                    {'loss': 0.5322, 'learning_rate': 1.7862952808853307e-05, 'epoch': 0.24}
 24%|██▎       | 827/3507 [20:48<1:12:29,  1.62s/it]tensor([[-4.8125, -4.5938, -3.1719,  0.2617, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.8594, -2.8750, -0.4688,  1.5234, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -2.3906, -0.3359,  2.5000, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -2.3750,  0.4414,  1.7734, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:35,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.33 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7188, -1.8828,  1.1484,  0.4180, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0469, -1.3281,  1.2812,  0.6992, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -2.5625,  0.3359,  0.9688, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[h264 @ 0xcac9200] mmco: unref short failure
tensor([[-5.3125, -3.9062, -0.5117,  1.2891, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:05:35,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:05:35,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.12 | bwd_microstep: 120.30 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 119.35 | step_microstep: 2.10
[2025-11-06 18:05:35,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.47 | bwd: 121.25 | bwd_inner: 1.73 | bwd_allreduce: 119.38 | step: 2.18
 24%|██▎       | 828/3507 [20:49<58:12,  1.30s/it]                                                    {'loss': 0.9259, 'learning_rate': 1.7857242204716497e-05, 'epoch': 0.24}
 24%|██▎       | 828/3507 [20:49<58:12,  1.30s/it]tensor([[-4.6562, -2.1875,  1.7891, -0.5820, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:35,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.51 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5312, -4.0938, -0.6992,  0.9609, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.8125, -2.0938,  1.5469, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7812, -1.3594,  2.0000, -1.0469, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -2.6406, -0.5508,  1.8438, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -1.7812,  2.1250, -0.1357, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5625, -4.5000, -0.7031, -1.3359, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.5117,  1.9453,  4.3125,  0.2080, -0.6445]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:37,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:05:37,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.09 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.97
[2025-11-06 18:05:37,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 427.60 | bwd: 2.78 | bwd_inner: 1.72 | bwd_allreduce: 0.91 | step: 3.06
 24%|██▎       | 829/3507 [20:50<1:02:09,  1.39s/it]                                                    {'loss': 0.2347, 'learning_rate': 1.7851524896231032e-05, 'epoch': 0.24}
 24%|██▎       | 829/3507 [20:50<1:02:09,  1.39s/it]tensor([[-3.3750, -2.0312,  0.4004,  0.6016, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.5000, -6.1250, -1.9375, -3.7344, -6.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.9062, -3.1406, -0.1914, -0.4844, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -2.4219,  1.1953, -0.0068, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -3.8750, -1.2812,  0.7812, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:37,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.50 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8281, -0.2520,  2.7344, -1.2500, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -3.6250, -1.0547,  1.3438, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -2.5781,  0.5859,  1.9375, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:37,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:05:37,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 41.61 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 40.55 | step_microstep: 2.55
[2025-11-06 18:05:37,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.19 | bwd: 42.44 | bwd_inner: 1.71 | bwd_allreduce: 40.59 | step: 2.63
 24%|██▎       | 830/3507 [20:51<49:57,  1.12s/it]                                                    {'loss': 1.0086, 'learning_rate': 1.784580088827532e-05, 'epoch': 0.24}
 24%|██▎       | 830/3507 [20:51<49:57,  1.12s/it]tensor([[-3.7969, -1.9609,  1.2891,  1.1016, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:37,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.58 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-1.9766, -0.3770,  2.0938,  1.9219, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406, -0.1475,  2.6406, -1.1172, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -3.8594, -0.6133, -0.3301, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3750, -0.4102,  1.8516, -0.3281, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -3.0469,  0.2139, -2.3125, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -2.2500,  0.0830,  1.6094, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594, -1.1172,  2.1562, -0.0532, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:38,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:05:38,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.13 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.49
[2025-11-06 18:05:38,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.71 | bwd: 3.08 | bwd_inner: 2.04 | bwd_allreduce: 0.89 | step: 2.62
 24%|██▎       | 831/3507 [20:52<50:51,  1.14s/it]                                                  {'loss': 0.6985, 'learning_rate': 1.784007018573348e-05, 'epoch': 0.24}
 24%|██▎       | 831/3507 [20:52<50:51,  1.14s/it]tensor([[-1.5078,  0.4004,  2.9844,  1.6484, -1.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -1.8750,  1.3828, -0.4238, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:38,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.64 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.1250, -4.0625,  0.0806, -0.3066, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -3.0469, -0.6562,  1.8750, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3750, -1.3516,  1.5938,  0.0203, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -2.6562,  0.0515,  2.2969, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.4688, -0.9336,  1.1797, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -2.0469,  1.9609, -0.3242, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:05:40,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:05:40,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.72 | bwd_microstep: 1095.49 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1094.46 | step_microstep: 1.74
[2025-11-06 18:05:40,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.35 | bwd: 1096.45 | bwd_inner: 1.82 | bwd_allreduce: 1094.50 | step: 1.82
 24%|██▎       | 832/3507 [20:53<55:13,  1.24s/it]                                                  {'loss': 0.4455, 'learning_rate': 1.7834332793495363e-05, 'epoch': 0.24}
 24%|██▎       | 832/3507 [20:53<55:13,  1.24s/it]tensor([[-2.4219, -0.6953,  1.3906, -0.5859, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:40,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.52 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4688, -4.5312, -1.7812,  0.8203, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.1250, -0.2285,  1.2734, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -3.8750, -0.2061, -0.2402, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9922, -0.7148,  1.6641,  2.7812, -1.1641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031, -1.9297, -0.0552,  1.8828, -1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -1.2578,  1.5234, -0.3301, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.5000, -1.1719, -0.1260, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:41,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:05:41,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.89 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.11
[2025-11-06 18:05:41,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.41 | bwd: 3.07 | bwd_inner: 2.11 | bwd_allreduce: 0.84 | step: 2.20
 24%|██▍       | 833/3507 [20:55<59:54,  1.34s/it]                                                  {'loss': 0.8049, 'learning_rate': 1.782858871645649e-05, 'epoch': 0.24}
 24%|██▍       | 833/3507 [20:55<59:54,  1.34s/it]tensor([[-3.8125, -3.3281, -1.4766,  1.8359, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -0.4082,  2.1250,  0.4805, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:41,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.67 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1250, -4.6250, -2.4531,  1.2812, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -2.3281,  0.2852,  2.4219, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8438, -0.7070,  2.3438,  0.3047, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.4375, -2.7344, -0.0177, -0.6914, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-2.7969, -2.5938, -1.2734,  2.1719, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -2.4219,  1.1719, -1.3125, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:05:42,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:05:42,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.59 | bwd_microstep: 764.15 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 762.97 | step_microstep: 1.72
[2025-11-06 18:05:42,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.28 | bwd: 764.93 | bwd_inner: 1.80 | bwd_allreduce: 763.01 | step: 1.79
 24%|██▍       | 834/3507 [20:56<56:33,  1.27s/it]                                                  {'loss': 0.1666, 'learning_rate': 1.7822837959518133e-05, 'epoch': 0.24}
 24%|██▍       | 834/3507 [20:56<56:33,  1.27s/it]tensor([[-3.7812, -2.5000,  0.3242,  1.6484, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[3.0625, 4.6875, 5.6875, 3.7656, 2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1719,  0.0469,  2.3750, -0.8711, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:43,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.70 | bwd_microstep: 1.20 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.1406, -1.2891,  1.6484,  0.5078, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -2.7812, -0.4141,  1.4609, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -1.2578,  1.5000,  0.7305, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000,  0.1260,  3.4219, -0.4570, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9219, -2.8750, -0.3867,  1.9219, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:44,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:05:44,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.61 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.83
[2025-11-06 18:05:44,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.32 | bwd: 3.23 | bwd_inner: 2.18 | bwd_allreduce: 0.90 | step: 1.93
 24%|██▍       | 835/3507 [20:58<1:00:44,  1.36s/it]                                                    {'loss': 0.4194, 'learning_rate': 1.7817080527587222e-05, 'epoch': 0.24}
 24%|██▍       | 835/3507 [20:58<1:00:44,  1.36s/it]tensor([[-3.9531, -3.1250, -0.6992,  2.1406, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -2.6875, -0.3887,  1.9844, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4844, -0.4023,  2.2188,  0.2910, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:44,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.90 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9062, -2.8281, -0.0574,  2.2812, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7500, -4.8438, -1.5547, -2.3906, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -1.5781,  2.3281, -1.2188, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -2.3438,  1.5469,  0.6328, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0000, -0.9961,  1.8359,  0.3086, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:05:44,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:05:44,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.06 | bwd_microstep: 30.83 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 29.44 | step_microstep: 1.68
[2025-11-06 18:05:44,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.99 | bwd: 31.70 | bwd_inner: 2.10 | bwd_allreduce: 29.48 | step: 1.76
 24%|██▍       | 836/3507 [20:58<48:07,  1.08s/it]                                                    {'loss': 0.198, 'learning_rate': 1.7811316425576414e-05, 'epoch': 0.24}
 24%|██▍       | 836/3507 [20:58<48:07,  1.08s/it]tensor([[-4.0312, -2.0469,  0.8516, -0.8906, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:45,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.79 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0312, -1.1172,  1.6406, -0.2412, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -0.9570,  1.8594,  0.3008, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3750, -4.4375, -0.5742, -0.2637, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -1.7812,  1.3516, -1.3594, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.0781, -0.8086,  2.3125, -0.0815, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.4844, -2.6094, -0.3027,  2.3281, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -0.5430,  2.9062, -0.7539, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:46,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 18:05:46,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.41 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.48
[2025-11-06 18:05:46,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.22 | bwd: 2.72 | bwd_inner: 1.74 | bwd_allreduce: 0.83 | step: 2.57
 24%|██▍       | 837/3507 [20:59<50:06,  1.13s/it]                                                  {'loss': 1.2263, 'learning_rate': 1.780554565840403e-05, 'epoch': 0.24}
 24%|██▍       | 837/3507 [20:59<50:06,  1.13s/it]tensor([[-4.2188, -1.6719,  1.4375, -2.2188, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9531,  0.6172,  3.2188,  3.0781, -0.4434]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -1.0156,  2.0938, -1.2969, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:46,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.44 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7812, -1.7734,  1.3125, -0.1514, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -4.0625, -2.4688,  0.8945, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.5156, -2.5625, -0.0801,  2.4219, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5625, -0.7031,  0.9883,  2.6562, -0.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -4.5938, -1.7500,  0.8906, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:47,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:05:47,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.23 | bwd_microstep: 3.99 | bwd_inner_microstep: 3.03 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.98
[2025-11-06 18:05:47,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.73 | bwd: 4.89 | bwd_inner: 3.80 | bwd_allreduce: 0.90 | step: 8.07
 24%|██▍       | 838/3507 [21:01<53:34,  1.20s/it]                                                  {'loss': 0.6628, 'learning_rate': 1.7799768230994105e-05, 'epoch': 0.24}
 24%|██▍       | 838/3507 [21:01<53:34,  1.20s/it]tensor([[-2.0469,  0.1406,  2.4688, -0.9102, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -3.3594, -1.3281,  1.3438, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-1.6797,  0.4785,  2.3438, -0.9453, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.2812, -0.9023,  1.7500,  2.9062, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:47,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.93 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5156, -2.8281, -0.9531,  1.4688, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.9062, -2.7812, -0.1118,  1.6172, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2969, -0.6133,  1.6797,  0.2734, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -1.7500,  1.7188,  0.4121, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:05:47,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:05:47,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.46 | bwd_microstep: 64.63 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 63.68 | step_microstep: 1.76
[2025-11-06 18:05:47,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.43 | bwd: 65.53 | bwd_inner: 1.64 | bwd_allreduce: 63.73 | step: 1.85
 24%|██▍       | 839/3507 [21:01<44:20,  1.00it/s]                                                  {'loss': 1.5724, 'learning_rate': 1.7793984148276342e-05, 'epoch': 0.24}
 24%|██▍       | 839/3507 [21:01<44:20,  1.00it/s]tensor([[-1.5938, -0.3887,  1.9688,  2.9219, -0.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.6406,  1.0938,  0.2773, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:48,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.42 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8906, -2.5000,  0.2148,  0.8594, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -3.2500,  0.6445,  0.3594, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7500, -2.4219, -1.0078,  2.2031, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0625, -2.5156, -0.7461,  2.0469, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -3.4219, -1.6953,  2.0312, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.7188, -0.7031,  0.8438, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:50,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.22
[2025-11-06 18:05:50,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.13 | bwd_microstep: 2.23 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 0.92 | step_microstep: 1.89
[2025-11-06 18:05:50,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.53 | bwd: 3.13 | bwd_inner: 2.02 | bwd_allreduce: 0.96 | step: 1.99
 24%|██▍       | 840/3507 [21:04<1:02:55,  1.42s/it]                                                    {'loss': 0.5891, 'learning_rate': 1.778819341518612e-05, 'epoch': 0.24}
 24%|██▍       | 840/3507 [21:04<1:02:55,  1.42s/it]tensor([[-2.5938, -0.1611,  2.3281, -1.4297, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -1.2188,  2.2031,  1.2578, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4531, -1.2266,  1.0781,  2.0781, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:50,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.23 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.9844, -2.1562,  1.1562,  0.5234, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -1.6797,  2.0938, -0.3887, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3438, -3.0781, -1.7500,  1.3203, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2188, -0.5977,  2.5938, -1.2266, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7812, -4.1562, -0.4531,  0.5469, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:05:51,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.22 | optimizer_step: 0.22
[2025-11-06 18:05:51,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.95 | bwd_microstep: 518.86 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 517.71 | step_microstep: 2.12
[2025-11-06 18:05:51,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.21 | bwd: 519.67 | bwd_inner: 1.79 | bwd_allreduce: 517.74 | step: 2.20
 24%|██▍       | 841/3507 [21:05<55:34,  1.25s/it]                                                    {'loss': 0.442, 'learning_rate': 1.77823960366645e-05, 'epoch': 0.24}
 24%|██▍       | 841/3507 [21:05<55:34,  1.25s/it]tensor([[-4.1250, -2.1094,  1.3750,  0.5547, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:51,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.47 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.8594, -2.3125,  0.5938,  0.8945, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3281, -1.7500,  0.2334,  3.6406, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -1.2344,  2.4375, -1.1641, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -1.1875,  1.5625, -0.5469, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781, -0.3262,  2.0781, -1.0312, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -1.4141,  0.5703,  1.5078, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9062, -0.4863,  2.2812, -1.5938, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:52,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:05:52,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.59 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.92 | step_microstep: 1.97
[2025-11-06 18:05:52,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.05 | bwd: 3.17 | bwd_inner: 2.02 | bwd_allreduce: 0.97 | step: 2.09
 24%|██▍       | 842/3507 [21:06<57:31,  1.30s/it]                                                  {'loss': 0.2418, 'learning_rate': 1.777659201765821e-05, 'epoch': 0.24}
 24%|██▍       | 842/3507 [21:06<57:31,  1.30s/it]tensor([[-4.1562, -3.3125, -0.7500,  1.8906, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -1.6641,  0.7617,  1.1641, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -1.2578,  2.4844, -1.6719, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:52,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.92 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.0312, -3.0312,  0.7578,  0.1592, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -1.4844,  0.4941, -1.0156, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:1')
tensor([[-2.8281, -2.2500, -0.7422,  1.4375, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-4.6875, -2.5938,  1.2891,  0.6797, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -1.0938,  2.8906, -1.4141, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:05:54,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:05:54,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.46 | bwd_microstep: 1279.17 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 1277.70 | step_microstep: 2.14
[2025-11-06 18:05:54,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.38 | bwd: 1280.18 | bwd_inner: 2.25 | bwd_allreduce: 1277.76 | step: 2.25
 24%|██▍       | 843/3507 [21:08<1:02:57,  1.42s/it]                                                    {'loss': 1.1831, 'learning_rate': 1.7770781363119644e-05, 'epoch': 0.24}
 24%|██▍       | 843/3507 [21:08<1:02:57,  1.42s/it]tensor([[-3.5625, -0.9336,  2.2812, -1.8828, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:54,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.82 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6250, -1.4766,  1.4766, -0.7305, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -2.5000,  0.4707,  1.3594, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1250, -1.8281, -0.5234,  2.5312, -1.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -1.7656,  1.3281,  0.7617, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -1.5312,  1.8828,  0.4102, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1328,  0.1826,  2.5938,  3.5625, -0.4492]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -3.7812,  0.5195, -1.1172, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:55,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:05:55,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.80 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.69
[2025-11-06 18:05:55,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.64 | bwd: 2.78 | bwd_inner: 1.74 | bwd_allreduce: 0.91 | step: 2.77
 24%|██▍       | 844/3507 [21:09<1:02:26,  1.41s/it]                                                    {'loss': 0.2484, 'learning_rate': 1.776496407800686e-05, 'epoch': 0.24}
 24%|██▍       | 844/3507 [21:09<1:02:26,  1.41s/it]tensor([[-3.6406, -2.5938, -0.0525,  1.7734, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -2.5312,  0.2080,  1.8828, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:55,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.33 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.0156,  0.1582,  2.4062, -0.9492, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.3281, -1.7422,  1.2344,  1.6797, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-1.9141,  0.0703,  2.2031, -0.1543, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -2.8281,  0.0608,  0.9570, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250, -2.5781, -0.7695,  2.2969, -1.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9375, -6.1250, -3.5156, -0.8672, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:05:57,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:05:57,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 75.78 | bwd_microstep: 1062.11 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 1061.15 | step_microstep: 1.44
[2025-11-06 18:05:57,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.13 | bwd: 1063.06 | bwd_inner: 1.74 | bwd_allreduce: 1061.19 | step: 1.53
 24%|██▍       | 845/3507 [21:10<1:01:42,  1.39s/it]                                                    {'loss': 0.6769, 'learning_rate': 1.7759140167283576e-05, 'epoch': 0.24}
 24%|██▍       | 845/3507 [21:10<1:01:42,  1.39s/it]tensor([[-3.3125, -0.7500,  2.1719, -1.6328, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -2.8594, -0.8125,  2.3750, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -2.0000,  0.5234,  2.2812, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6562, -0.4609,  2.2344, -0.2910, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281, -1.9766,  1.0469, -0.1777, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:57,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.83 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.7109,  0.7578,  2.4375,  0.9766, -0.5352]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2500, -1.2031,  0.9961,  2.1719, -1.3672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -1.7031,  2.1250, -0.9844, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:05:57,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:05:57,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.90 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.14
[2025-11-06 18:05:57,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 428.76 | bwd: 2.83 | bwd_inner: 1.84 | bwd_allreduce: 0.87 | step: 2.23
 24%|██▍       | 846/3507 [21:11<53:55,  1.22s/it]                                                    {'loss': 0.3474, 'learning_rate': 1.775330963591916e-05, 'epoch': 0.24}
 24%|██▍       | 846/3507 [21:11<53:55,  1.22s/it]tensor([[-2.8281, -2.4219, -0.8125,  2.2656, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -2.7656, -0.5977,  2.5938, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -2.1719, -0.5664,  2.2969, -1.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:58,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.07 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.7344, -3.3125, -1.5078,  1.6406, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.6875,  0.5195,  1.6562, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -2.4375,  0.4688,  1.2656, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -3.3125,  0.0771,  1.4531, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6562, -1.9609,  1.0547,  0.8555, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:05:59,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:05:59,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.07 | bwd_microstep: 1184.89 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1183.80 | step_microstep: 1.84
[2025-11-06 18:05:59,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.16 | bwd: 1185.85 | bwd_inner: 1.88 | bwd_allreduce: 1183.84 | step: 1.91
 24%|██▍       | 847/3507 [21:13<58:19,  1.32s/it]                                                  {'loss': 0.3354, 'learning_rate': 1.7747472488888622e-05, 'epoch': 0.24}
 24%|██▍       | 847/3507 [21:13<58:19,  1.32s/it]tensor([[-4.0938, -1.9375,  1.7266,  0.3828, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3750, -2.7344, -0.4941,  2.6406, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:05:59,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.93 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-1.6328,  0.7812,  3.4688, -0.4434, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -2.3125,  0.9375, -1.8594, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531, -1.7422,  1.1719,  0.9688, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -2.5156,  1.2734, -0.5391, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -3.6094, -0.6953, -0.7266, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -4.1250, -1.7656,  1.8516, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:00,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:06:00,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.42 | bwd_microstep: 328.62 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 327.59 | step_microstep: 1.53
[2025-11-06 18:06:00,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 257.36 | bwd: 329.67 | bwd_inner: 1.88 | bwd_allreduce: 327.64 | step: 1.62
 24%|██▍       | 848/3507 [21:14<56:11,  1.27s/it]                                                  {'loss': 0.7199, 'learning_rate': 1.774162873117263e-05, 'epoch': 0.24}
 24%|██▍       | 848/3507 [21:14<56:11,  1.27s/it]tensor([[-1.7422, -1.5156, -0.5039,  2.3594, -0.7227]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:06:00,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.27 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0312, -5.5312, -3.0938,  0.4277, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -1.8438,  1.4375,  0.2041, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125, -0.4082,  2.4062, -1.1328, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.7188, -0.4766,  2.3594, -0.6211, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -4.5625, -1.9766,  0.9062, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719, -1.9609,  1.2266,  1.2969, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0312, -0.8242,  2.2500,  0.1758, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:01,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.28
[2025-11-06 18:06:01,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.94 | bwd_microstep: 652.15 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 650.88 | step_microstep: 2.07
[2025-11-06 18:06:01,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.23 | bwd: 653.13 | bwd_inner: 2.02 | bwd_allreduce: 650.93 | step: 2.16
 24%|██▍       | 849/3507 [21:15<52:56,  1.20s/it]                                                  {'loss': 1.1024, 'learning_rate': 1.7735778367757484e-05, 'epoch': 0.24}
 24%|██▍       | 849/3507 [21:15<52:56,  1.20s/it]tensor([[-0.4883,  0.4668,  2.1719,  3.2656,  0.0554]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:01,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.61 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8125, -2.0781,  1.0938,  0.8828, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9844, -2.6094, -1.0391,  1.9062, -1.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -2.8438,  0.3242,  1.1406, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -1.5781,  2.1094,  0.0781, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9062, -3.4062, -1.4609,  1.5625, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719, -0.4121,  2.0000,  0.8164, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438, -1.8828,  0.7812,  1.3594, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:03,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:06:03,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.60 | bwd_microstep: 1051.42 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1050.30 | step_microstep: 1.80
[2025-11-06 18:06:03,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 241.22 | bwd: 1052.33 | bwd_inner: 1.84 | bwd_allreduce: 1050.35 | step: 1.88
 24%|██▍       | 850/3507 [21:16<57:26,  1.30s/it]                                                  {'loss': 0.3478, 'learning_rate': 1.7729921403635128e-05, 'epoch': 0.24}
 24%|██▍       | 850/3507 [21:16<57:26,  1.30s/it]tensor([[-5.0625, -3.5938, -0.0913,  1.2188, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7344, -2.2188, -0.3438,  2.8906, -1.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.5156, -1.3125,  1.7031, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -1.7344,  1.4688,  0.2812, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:03,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.74 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9062, -1.8594,  0.3457,  1.2578, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -1.8281,  1.3281,  0.5977, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -2.4219,  1.6719, -0.6406, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4062, -2.1250,  1.7344, -0.0464, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:04,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:06:04,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.10 | bwd_microstep: 525.68 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 524.60 | step_microstep: 1.91
[2025-11-06 18:06:04,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 516.87 | bwd: 526.48 | bwd_inner: 1.71 | bwd_allreduce: 524.64 | step: 1.99
 24%|██▍       | 851/3507 [21:18<54:39,  1.23s/it]                                                  {'loss': 0.3427, 'learning_rate': 1.7724057843803127e-05, 'epoch': 0.24}
 24%|██▍       | 851/3507 [21:18<54:39,  1.23s/it]tensor([[-3.7969, -2.8125, -0.2354,  1.7891, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -3.3750, -1.0312,  0.8984, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:04,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.35 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.8516,  0.8906,  3.2812,  1.8516, -0.5586]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1406, -0.0811,  2.1406, -0.6641, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -1.8047,  1.6172,  1.3438, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4688, -2.0469,  0.9883,  1.8594, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -4.2500, -0.9688,  0.3105, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -3.3594, -1.5078,  1.7969, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:04,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:06:04,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.34 | bwd_microstep: 354.64 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 353.51 | step_microstep: 1.62
[2025-11-06 18:06:04,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.71 | bwd: 355.35 | bwd_inner: 1.68 | bwd_allreduce: 353.55 | step: 1.69
 24%|██▍       | 852/3507 [21:18<48:14,  1.09s/it]                                                  {'loss': 0.5044, 'learning_rate': 1.7718187693264687e-05, 'epoch': 0.24}
 24%|██▍       | 852/3507 [21:18<48:14,  1.09s/it]tensor([[-4.8438, -2.4531,  1.7812, -0.2969, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.1562,  0.1074, -0.1523, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:05,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.52 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.0312, -2.2031,  0.9414,  0.3184, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -2.6562,  0.1934,  0.2002, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0767,  0.9102,  2.7500,  4.2500,  0.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3438, -1.2422,  1.9531, -0.1226, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0312, -1.1719,  1.7344,  0.6484, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094, -1.4766,  1.2891,  0.7773, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:07,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:06:07,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.30 | bwd_microstep: 2420.51 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 2419.34 | step_microstep: 2.28
[2025-11-06 18:06:07,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.85 | bwd: 2421.46 | bwd_inner: 1.95 | bwd_allreduce: 2419.39 | step: 2.36
 24%|██▍       | 853/3507 [21:21<1:10:18,  1.59s/it]                                                    {'loss': 0.4734, 'learning_rate': 1.7712310957028626e-05, 'epoch': 0.24}
 24%|██▍       | 853/3507 [21:21<1:10:18,  1.59s/it]tensor([[-4.5312, -3.2969, -0.0981,  1.7344, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.7031, -1.7109,  1.0625, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:07,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.76 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0938, -2.5000, -0.3887,  2.4688, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -1.5781,  2.2500, -1.1641, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031, -2.8594, -0.4902,  1.5391, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -3.5312, -0.1816, -0.0698, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -2.2188,  1.8906,  0.1543, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -2.7344, -0.6094,  2.2812, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:08,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:06:08,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.49 | bwd_microstep: 86.94 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 85.78 | step_microstep: 1.43
[2025-11-06 18:06:08,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 430.28 | bwd: 87.73 | bwd_inner: 1.81 | bwd_allreduce: 85.81 | step: 1.52
 24%|██▍       | 854/3507 [21:22<56:33,  1.28s/it]                                                    {'loss': 0.1838, 'learning_rate': 1.7706427640109386e-05, 'epoch': 0.24}
 24%|██▍       | 854/3507 [21:22<56:33,  1.28s/it]tensor([[-3.9219, -3.0312, -0.5742,  1.5859, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -2.9375, -0.1060,  1.3047, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:08,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.80 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0625, -3.0156, -0.3184,  1.3359, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.1406, -0.3984,  1.2891, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -2.5938, -0.2520,  2.1562, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9297,  0.0294,  1.9922, -0.8438, -1.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0938, -1.3359,  1.6641,  1.3906, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.1250, -0.0815,  1.5938, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:08,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:06:08,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.34 | bwd_microstep: 70.54 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 69.37 | step_microstep: 1.46
[2025-11-06 18:06:08,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.16 | bwd: 71.59 | bwd_inner: 2.06 | bwd_allreduce: 69.41 | step: 1.55
 24%|██▍       | 855/3507 [21:22<45:01,  1.02s/it]                                                  {'loss': 0.2723, 'learning_rate': 1.770053774752703e-05, 'epoch': 0.24}
 24%|██▍       | 855/3507 [21:22<45:01,  1.02s/it]tensor([[-4.5938, -2.1875,  1.6328, -0.8711, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -3.8750, -0.3730,  0.6367, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.3281, -0.4961,  1.3828, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -4.2500, -1.3828,  1.2344, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:09,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.37 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0000, -3.3594,  0.3438,  1.2109, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -3.3594, -0.1221,  0.5234, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -2.7812,  0.4258,  0.7461, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2969, -2.3125,  0.3047,  2.4062, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:10,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:06:10,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.15 | bwd_microstep: 384.16 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 382.95 | step_microstep: 1.66
[2025-11-06 18:06:10,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.54 | bwd: 385.26 | bwd_inner: 2.12 | bwd_allreduce: 382.99 | step: 1.76
 24%|██▍       | 856/3507 [21:24<51:55,  1.18s/it]                                                  {'loss': 0.6034, 'learning_rate': 1.769464128430722e-05, 'epoch': 0.24}
 24%|██▍       | 856/3507 [21:24<51:55,  1.18s/it]tensor([[-3.7812, -1.8984,  1.2031,  0.4102, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -1.7969,  1.1172,  0.6289, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0000, -3.2031, -0.7070,  1.9141, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:10,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.00 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7500, -2.5000,  1.4453, -0.2275, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -1.8281,  2.1875, -1.0391, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -2.9219, -0.5430,  1.9141, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -1.3516,  1.7109, -0.2812, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4062, -2.3750,  1.2969,  0.2539, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:11,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:06:11,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.79 | bwd_microstep: 154.35 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 153.08 | step_microstep: 1.73
[2025-11-06 18:06:11,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.81 | bwd: 155.15 | bwd_inner: 1.83 | bwd_allreduce: 153.13 | step: 1.81
 24%|██▍       | 857/3507 [21:25<52:33,  1.19s/it]                                                  {'loss': 0.239, 'learning_rate': 1.7688738255481233e-05, 'epoch': 0.24}
 24%|██▍       | 857/3507 [21:25<52:33,  1.19s/it]tensor([[-3.8594, -1.5625,  1.9922, -0.1797, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.4219, -0.8594,  1.9453, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.1562, -0.5312,  1.3125, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:11,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.77 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7188, -3.3594, -0.1846,  1.2266, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -4.0625, -1.1797,  0.6055, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -1.9688, -0.1245,  2.2188, -1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -1.5000,  1.9141, -0.6875, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -4.3438, -1.1641,  0.9141, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:06:14,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.22 | optimizer_gradients: 0.22 | optimizer_step: 0.33
[2025-11-06 18:06:14,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.13 | bwd_microstep: 1669.29 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1668.20 | step_microstep: 4.58
[2025-11-06 18:06:14,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 434.93 | bwd: 1670.06 | bwd_inner: 1.67 | bwd_allreduce: 1668.26 | step: 4.65
 24%|██▍       | 858/3507 [21:27<1:12:10,  1.63s/it]                                                    {'loss': 0.9707, 'learning_rate': 1.768282866608595e-05, 'epoch': 0.24}
 24%|██▍       | 858/3507 [21:27<1:12:10,  1.63s/it]tensor([[-4.0000, -3.6719, -1.9375,  1.2891, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.0781, -2.1406,  0.1396,  2.0469, -1.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:14,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.88 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4375, -3.4375,  0.5859,  0.0439, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -3.1250, -0.6758,  2.0625, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -0.6094,  2.5000, -1.4375, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0781, -1.6328,  1.2266,  1.2734, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -3.3594,  0.1436,  1.3281, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -0.9062,  2.5938,  0.2539, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:06:14,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:06:14,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 217.85 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 216.78 | step_microstep: 1.81
[2025-11-06 18:06:14,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.18 | bwd: 218.90 | bwd_inner: 1.95 | bwd_allreduce: 216.82 | step: 1.89
 24%|██▍       | 859/3507 [21:28<58:50,  1.33s/it]                                                    {'loss': 0.7888, 'learning_rate': 1.767691252116384e-05, 'epoch': 0.24}
 24%|██▍       | 859/3507 [21:28<58:50,  1.33s/it]tensor([[-2.3906, -0.6641,  2.0000,  1.1797, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8438, -1.5312,  0.8867,  1.0078, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -1.3047,  1.7969,  1.5859, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594, -0.0728,  3.3906, -1.0547, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0312, -1.1094,  1.7188,  0.0410, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -3.0156, -0.5742, -0.8906, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:15,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.57 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.4688, -3.4688, -0.7109,  1.5781, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938, -1.9453,  1.6172, -0.3828, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:06:16,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:06:16,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.19 | bwd_microstep: 769.24 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 768.18 | step_microstep: 2.75
[2025-11-06 18:06:16,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 493.78 | bwd: 770.05 | bwd_inner: 1.66 | bwd_allreduce: 768.23 | step: 2.86
 25%|██▍       | 860/3507 [21:30<1:02:31,  1.42s/it]                                                    {'loss': 0.3989, 'learning_rate': 1.7670989825762975e-05, 'epoch': 0.25}
 25%|██▍       | 860/3507 [21:30<1:02:31,  1.42s/it]tensor([[-3.8750, -3.3281, -1.1016,  2.2969, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.6406, -1.8047,  1.4297,  1.2031, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -2.7344, -0.0796,  1.6953, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:16,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.32 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8438, -2.5625,  0.3086,  1.3672, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -2.8594,  0.8633,  0.8086, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -3.4219, -0.8789,  1.1328, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -2.7656,  0.8633,  0.1021, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031, -0.7188,  3.0938, -2.0938, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:17,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:06:17,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.36 | bwd_microstep: 224.61 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 223.45 | step_microstep: 1.87
[2025-11-06 18:06:17,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 500.68 | bwd: 225.45 | bwd_inner: 1.83 | bwd_allreduce: 223.49 | step: 1.96
 25%|██▍       | 861/3507 [21:31<1:00:40,  1.38s/it]                                                    {'loss': 1.216, 'learning_rate': 1.7665060584937018e-05, 'epoch': 0.25}
 25%|██▍       | 861/3507 [21:31<1:00:40,  1.38s/it]tensor([[-3.0938, -2.2188,  0.2197,  2.6406, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.1562, -0.2217,  1.2422, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:17,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.95 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.6562, -3.1562,  0.3008,  1.1250, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -1.8281,  1.4297,  0.5547, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2656,  0.1216,  2.8906, -0.4297, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5625, -1.6875,  1.5547,  0.6094, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -2.9062, -0.6602,  2.2656, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7188, -2.3438, -0.6016,  2.8438, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:18,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:06:18,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.78 | bwd_microstep: 129.13 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 127.97 | step_microstep: 2.05
[2025-11-06 18:06:18,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.77 | bwd: 130.04 | bwd_inner: 1.91 | bwd_allreduce: 128.01 | step: 2.12
 25%|██▍       | 862/3507 [21:32<50:39,  1.15s/it]                                                    {'loss': 0.8524, 'learning_rate': 1.76591248037452e-05, 'epoch': 0.25}
 25%|██▍       | 862/3507 [21:32<50:39,  1.15s/it]tensor([[3.8281, 4.2188, 4.4375, 6.0000, 3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -2.5469,  0.2559,  1.7656, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.1094, -0.2383,  1.7891, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.4062,  1.8047, -0.9258, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:18,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.85 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.0547,  0.1079,  2.4688,  3.7031, -0.3223]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -4.9375, -0.8750, -0.0172, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.0625, -6.1562, -2.2812, -2.6250, -6.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -2.6250, -0.0427,  1.8750, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:21,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.35 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:06:21,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.71 | bwd_microstep: 1.94 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.90 | step_microstep: 3.90
[2025-11-06 18:06:21,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.59 | bwd: 2.85 | bwd_inner: 1.80 | bwd_allreduce: 0.92 | step: 3.97
 25%|██▍       | 863/3507 [21:35<1:19:29,  1.80s/it]                                                    {'loss': 0.6805, 'learning_rate': 1.7653182487252355e-05, 'epoch': 0.25}
 25%|██▍       | 863/3507 [21:35<1:19:29,  1.80s/it]tensor([[-5.0312, -3.3906,  0.3613,  1.0391, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -2.4688,  0.9648,  0.5938, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.6250, -3.8281, -1.1406,  1.7734, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([3], device='cuda:0')
[2025-11-06 18:06:21,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.80 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1875, -3.0312,  1.0859,  0.1182, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -4.0938, -0.8945,  0.9219, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7344,  0.4609,  2.9219, -0.4590, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.9375, -2.5312,  1.6094, -0.1631, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2031, -1.8359,  1.1172,  2.1406, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:22,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:06:22,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.16 | bwd_microstep: 46.87 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 46.08 | step_microstep: 2.29
[2025-11-06 18:06:22,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.99 | bwd: 47.71 | bwd_inner: 1.43 | bwd_allreduce: 46.14 | step: 2.38
 25%|██▍       | 864/3507 [21:35<1:02:14,  1.41s/it]                                                    {'loss': 0.6284, 'learning_rate': 1.764723364052888e-05, 'epoch': 0.25}
 25%|██▍       | 864/3507 [21:35<1:02:14,  1.41s/it]tensor([[-2.4844, -1.3281,  1.1406,  2.2344, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9609, -0.2715,  1.7109,  0.3633, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -3.1094, -0.7539,  1.4844, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:06:22,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.50 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8125, -2.9062,  0.2676, -1.0234, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -2.2500,  0.6836, -0.9023, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -2.3438,  0.9844,  1.9766, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -2.2969,  0.8867,  1.1484, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9375, -2.6094,  0.4355,  1.7188, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:23,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.37 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:06:23,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.40 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 4.05
[2025-11-06 18:06:23,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.91 | bwd: 2.51 | bwd_inner: 1.57 | bwd_allreduce: 0.80 | step: 4.14
 25%|██▍       | 865/3507 [21:37<1:02:18,  1.41s/it]                                                    {'loss': 0.3566, 'learning_rate': 1.7641278268650743e-05, 'epoch': 0.25}
 25%|██▍       | 865/3507 [21:37<1:02:18,  1.41s/it]tensor([[-3.7500, -2.9062, -0.6602,  0.9219, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3281, -2.0781, -0.5547,  2.8438, -1.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:23,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.44 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -2.9531,  0.3359,  0.5586, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -1.8203,  2.0312, -0.4492, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062, -2.9219, -0.8672,  2.5781, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2812, -2.0781,  0.3066,  0.9297, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -3.3438, -1.4688,  1.7188, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.4062, -3.0156,  0.0084,  0.8281, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:24,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:06:24,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.09 | bwd_microstep: 174.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 174.00 | step_microstep: 2.76
[2025-11-06 18:06:24,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.54 | bwd: 175.56 | bwd_inner: 1.38 | bwd_allreduce: 174.04 | step: 2.86
 25%|██▍       | 866/3507 [21:37<50:16,  1.14s/it]                                                    {'loss': 0.9513, 'learning_rate': 1.763531637669949e-05, 'epoch': 0.25}
 25%|██▍       | 866/3507 [21:37<50:16,  1.14s/it]tensor([[-4.2188, -1.9141,  1.5938, -0.9688, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:24,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 84.91 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5781, -1.9141,  1.1250,  0.9102, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -2.9375,  0.4023,  1.2266, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -2.3750,  1.8125, -1.1953, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -2.0625,  0.5195,  0.8359, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -2.1562,  1.0234,  1.5391, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -2.2969,  1.3281,  1.3594, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625, -2.7500, -1.1484,  2.0000, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:26,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:06:26,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.21 | bwd_microstep: 2.35 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.96
[2025-11-06 18:06:26,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.13 | bwd: 3.01 | bwd_inner: 1.91 | bwd_allreduce: 0.96 | step: 3.04
 25%|██▍       | 867/3507 [21:40<1:14:01,  1.68s/it]                                                    {'loss': 0.8322, 'learning_rate': 1.762934796976222e-05, 'epoch': 0.25}
 25%|██▍       | 867/3507 [21:40<1:14:01,  1.68s/it]tensor([[-3.7500, -2.6406,  0.0854,  1.8281, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6094, -0.7500,  3.0938, -1.4531, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4531, -2.5781, -0.0588,  2.2969, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:27,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.35 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9062, -3.2188, -0.9023,  1.6172, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -3.0000, -1.2188,  1.5234, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3594, -2.0000, -0.3359,  3.0156, -1.0859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.6250, -1.8203,  1.0391, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -1.4297,  2.3125, -0.8945, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:27,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 18:06:27,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.67 | bwd_microstep: 98.36 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 97.30 | step_microstep: 2.89
[2025-11-06 18:06:27,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.05 | bwd: 99.12 | bwd_inner: 1.67 | bwd_allreduce: 97.33 | step: 2.97
 25%|██▍       | 868/3507 [21:41<57:53,  1.32s/it]                                                    {'loss': 0.6082, 'learning_rate': 1.7623373052931598e-05, 'epoch': 0.25}
 25%|██▍       | 868/3507 [21:41<57:53,  1.32s/it]tensor([[-4.8750, -3.2344,  0.1260,  0.0608, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0000, -0.5742,  2.5312, -0.7773, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6719, -0.6641,  1.4141, -1.4609, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.6172,  0.6680,  3.8438,  1.2422, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:27,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.13 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-2.8906, -0.6328,  2.6875,  0.4609, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2031, -0.3965,  1.9766, -0.4375, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -3.9062, -1.2734,  1.2656, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -3.9219, -0.5977,  0.1162, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:28,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:06:28,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.26 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.42
[2025-11-06 18:06:28,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.42 | bwd: 2.98 | bwd_inner: 1.91 | bwd_allreduce: 0.91 | step: 2.55
 25%|██▍       | 869/3507 [21:42<58:52,  1.34s/it]                                                  {'loss': 0.5133, 'learning_rate': 1.7617391631305843e-05, 'epoch': 0.25}
 25%|██▍       | 869/3507 [21:42<58:52,  1.34s/it]tensor([[-3.3750, -2.3281,  0.2773,  1.7969, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.9062, -1.3438,  1.2266, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7344, -2.5156,  0.3184,  1.7812, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:29,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.72 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.6406, -1.9219,  1.6094,  1.5547, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -5.3125, -1.6250, -0.8008, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -1.6484,  1.5625, -0.4414, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -3.5938, -0.5508,  0.7031, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7969, -1.2969,  1.7109,  2.2500, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:29,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.15 | optimizer_step: 0.21
[2025-11-06 18:06:29,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.16 | bwd_microstep: 283.73 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 282.48 | step_microstep: 2.57
[2025-11-06 18:06:29,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.91 | bwd: 284.85 | bwd_inner: 2.15 | bwd_allreduce: 282.54 | step: 2.69
 25%|██▍       | 870/3507 [21:43<50:22,  1.15s/it]                                                  {'loss': 0.3288, 'learning_rate': 1.7611403709988716e-05, 'epoch': 0.25}
 25%|██▍       | 870/3507 [21:43<50:22,  1.15s/it]tensor([[-6.4688, -5.5625, -2.4844,  0.0986, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:29,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.45 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.6719, -2.1562,  0.7852,  0.9492, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062, -0.0554,  2.7500, -0.5273, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -2.6562,  1.1797,  0.0067, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -3.4375, -0.2012, -1.8281, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -2.3750,  1.0156,  0.9141, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7344, -1.9297,  1.0078, -0.1406, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -2.8906,  0.2227, -2.0156, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:06:30,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.14 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:06:30,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.47 | bwd_microstep: 121.31 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 120.10 | step_microstep: 3.04
[2025-11-06 18:06:30,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.93 | bwd: 122.55 | bwd_inner: 2.24 | bwd_allreduce: 120.16 | step: 3.15
 25%|██▍       | 871/3507 [21:43<42:24,  1.04it/s]                                                  {'loss': 1.0296, 'learning_rate': 1.760540929408953e-05, 'epoch': 0.25}
 25%|██▍       | 871/3507 [21:43<42:24,  1.04it/s]tensor([[-3.7031, -1.1797,  2.4844, -0.3789, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -3.3594, -0.7188,  0.6992, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7969, -0.7734,  2.0469, -0.2441, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:30,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.21 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-3.8125, -1.9453,  1.0312, -0.7695, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -1.7422,  1.0938,  1.1484, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0312, -1.0625,  2.1719,  0.9297, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -3.8594, -0.3164,  1.3203, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6875, -3.0625, -0.7695,  2.1406, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:32,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.78 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:06:32,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 267.28 | bwd_microstep: 724.46 | bwd_inner_microstep: 2.05 | bwd_allreduce_microstep: 722.29 | step_microstep: 4.57
[2025-11-06 18:06:32,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.51 | bwd: 725.53 | bwd_inner: 3.03 | bwd_allreduce: 722.33 | step: 4.66
 25%|██▍       | 872/3507 [21:46<57:29,  1.31s/it]                                                  {'loss': 0.4813, 'learning_rate': 1.759940838872315e-05, 'epoch': 0.25}
 25%|██▍       | 872/3507 [21:46<57:29,  1.31s/it]tensor([[-5.0938, -3.8438, -0.7656,  0.4824, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -1.4688,  2.2656, -0.6445, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -4.0938, -0.6484,  0.1748, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:32,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.90 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.4375, -2.6094, -0.3867,  1.5938, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -2.5312,  0.8945,  1.1562, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7188, -4.3125, -0.7227, -3.2188, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125, -1.6172,  0.9219,  2.0156, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9062,  0.2637,  3.3594,  1.2266, -1.4766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:32,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:06:32,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.25 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.68 | step_microstep: 1.58
[2025-11-06 18:06:32,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.19 | bwd: 2.68 | bwd_inner: 1.86 | bwd_allreduce: 0.71 | step: 1.65
 25%|██▍       | 873/3507 [21:46<45:47,  1.04s/it]                                                  {'loss': 0.2971, 'learning_rate': 1.759340099900996e-05, 'epoch': 0.25}
 25%|██▍       | 873/3507 [21:46<45:47,  1.04s/it]tensor([[-0.7305,  1.0391,  2.8281,  0.3047, -0.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:06:32,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.38 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9219, -0.5391,  2.0469, -1.3281, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.0625, -2.2188,  1.2188,  0.9727, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3828,  0.8125,  2.9062, -0.6016, -1.3516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.7500, -2.2656, -0.2676,  3.1875, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -1.9531,  1.9531, -0.6680, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5000, -1.3438,  1.3203,  3.0156, -1.4141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.2344, -0.4004,  1.3516, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:35,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:06:35,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.48 | bwd_microstep: 1192.65 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 1191.34 | step_microstep: 2.51
[2025-11-06 18:06:35,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.88 | bwd: 1193.55 | bwd_inner: 2.02 | bwd_allreduce: 1191.39 | step: 2.59
 25%|██▍       | 874/3507 [21:48<1:03:33,  1.45s/it]                                                    {'loss': 1.0193, 'learning_rate': 1.7587387130075883e-05, 'epoch': 0.25}
 25%|██▍       | 874/3507 [21:48<1:03:33,  1.45s/it]tensor([[-4.3750, -2.2656,  1.2734, -0.6016, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -3.5000, -0.9102,  1.5625, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -2.8594,  0.2910,  0.9961, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:35,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.21 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6250, -3.1406,  0.1089,  1.0625, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5234,  1.5781,  3.7500,  0.5625, -0.5742]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6250,  0.1289,  2.6094,  1.6406, -1.1172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -3.2188,  0.3281, -0.4492, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -2.2812,  0.3672,  2.5000, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:35,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.20 | optimizer_step: 0.17
[2025-11-06 18:06:35,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.07 | bwd_microstep: 128.50 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 127.35 | step_microstep: 2.24
[2025-11-06 18:06:35,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.31 | bwd: 129.41 | bwd_inner: 1.90 | bwd_allreduce: 127.38 | step: 2.32
 25%|██▍       | 875/3507 [21:49<50:41,  1.16s/it]                                                    {'loss': 0.3594, 'learning_rate': 1.7581366787052384e-05, 'epoch': 0.25}
 25%|██▍       | 875/3507 [21:49<50:41,  1.16s/it]tensor([[-3.7500, -1.0000,  2.2188, -2.1094, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9531, -2.6562,  0.2852,  1.4766, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9844, -3.5312, -1.5078,  1.4219, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6406, -1.0938,  2.6875, -0.2500, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:35,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.34 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2031, -0.7070,  2.4844, -0.9180, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9453,  0.0601,  2.7812,  0.7969, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -4.1562, -1.7812,  0.9375, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8594,  0.3535,  2.2344, -1.0859, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:37,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:06:37,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 1972.36 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1971.16 | step_microstep: 2.33
[2025-11-06 18:06:37,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.90 | bwd: 1973.38 | bwd_inner: 2.03 | bwd_allreduce: 1971.21 | step: 2.42
 25%|██▍       | 876/3507 [21:51<1:06:57,  1.53s/it]                                                    {'loss': 1.395, 'learning_rate': 1.757533997507643e-05, 'epoch': 0.25}
 25%|██▍       | 876/3507 [21:51<1:06:57,  1.53s/it]tensor([[-3.4375, -0.8516,  2.2500, -1.5469, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:38,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.99 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1562, -1.2812,  2.6094, -1.7266, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -4.4375, -1.9922,  1.2578, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -2.5000,  0.4199,  1.5234, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -3.0781,  0.8789,  0.7891, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2969,  0.2314,  3.1094, -0.7773, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.9375, -2.3125,  1.2344, -2.5156, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5625, -1.4453,  1.5156, -0.3066, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:06:38,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:06:38,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.85 | bwd_microstep: 30.78 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 29.47 | step_microstep: 1.81
[2025-11-06 18:06:38,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.87 | bwd: 31.84 | bwd_inner: 2.15 | bwd_allreduce: 29.52 | step: 1.90
 25%|██▌       | 877/3507 [21:52<52:16,  1.19s/it]                                                    {'loss': 0.5582, 'learning_rate': 1.7569306699290517e-05, 'epoch': 0.25}
 25%|██▌       | 877/3507 [21:52<52:16,  1.19s/it]tensor([[-2.1406, -0.3945,  1.8516,  0.3691, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -3.2812, -1.6875,  1.0781, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9219, -1.4141,  0.3867,  3.2344, -0.7852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -1.1484,  2.2344, -1.4531, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -4.0625, -0.8164,  0.4883, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.2002, 1.5156, 3.5156, 3.6875, 0.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:38,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.27 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7500, -3.3125, -0.1108,  1.0078, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -1.9062,  1.1641, -0.5508, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:40,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 18:06:40,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.16 | bwd_microstep: 1721.37 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1720.31 | step_microstep: 2.10
[2025-11-06 18:06:40,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.45 | bwd: 1722.43 | bwd_inner: 1.90 | bwd_allreduce: 1720.37 | step: 2.18
 25%|██▌       | 878/3507 [21:54<1:07:35,  1.54s/it]                                                    {'loss': 0.2601, 'learning_rate': 1.7563266964842666e-05, 'epoch': 0.25}
 25%|██▌       | 878/3507 [21:54<1:07:35,  1.54s/it]tensor([[-2.9219, -2.6875, -1.3203,  1.6641, -1.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -3.6406, -1.1641,  1.8750, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -3.2500, -1.1328,  2.2031, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:40,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.55 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.7500, -2.3438, -0.5625,  2.6719, -1.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -1.8750,  1.3047, -0.1133, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -2.8906, -0.3809,  1.2578, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3125, -4.7812, -1.0312, -3.9062, -6.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.2656, -0.5703,  2.9531, -0.7383, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:06:41,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.36 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:06:41,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.98 | bwd_microstep: 161.64 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 160.55 | step_microstep: 3.97
[2025-11-06 18:06:41,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.55 | bwd: 162.70 | bwd_inner: 1.92 | bwd_allreduce: 160.61 | step: 4.09
 25%|██▌       | 879/3507 [21:55<54:55,  1.25s/it]                                                    {'loss': 0.7582, 'learning_rate': 1.75572207768864e-05, 'epoch': 0.25}
 25%|██▌       | 879/3507 [21:55<54:55,  1.25s/it]tensor([[-4.1875, -2.4219,  1.1328,  1.2188, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -2.8750,  0.3906,  1.5703, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -2.9531, -1.2266,  2.3438, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -4.1875, -1.8281,  1.2500, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:41,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.91 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9531, -2.0469,  1.2344,  0.3379, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3906, -1.0703,  1.4922, -1.5000, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5625, -3.1562,  0.2051,  1.4297, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -1.4219,  2.4062, -0.3047, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:42,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:06:42,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.49 | bwd_microstep: 710.84 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 709.49 | step_microstep: 2.40
[2025-11-06 18:06:42,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.43 | bwd: 711.50 | bwd_inner: 1.82 | bwd_allreduce: 709.52 | step: 2.48
 25%|██▌       | 880/3507 [21:56<52:45,  1.21s/it]                                                  {'loss': 0.5662, 'learning_rate': 1.7551168140580745e-05, 'epoch': 0.25}
 25%|██▌       | 880/3507 [21:56<52:45,  1.21s/it]tensor([[-3.9531, -3.6406, -1.8438,  1.4766, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.2500, -0.6797,  1.0547, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:42,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.63 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.7891,  0.7422,  3.3906, -1.0234, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6875,  0.8242,  3.4375, -0.8242, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6719, -2.2812, -0.6055,  2.3125, -1.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.8125, -2.9219, -0.4570,  1.4609, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -0.7344,  1.8906, -1.4062, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0312, -0.5039,  2.1406,  2.0000, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:43,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:06:43,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.02 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.79 | step_microstep: 3.02
[2025-11-06 18:06:43,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.61 | bwd: 2.62 | bwd_inner: 1.69 | bwd_allreduce: 0.81 | step: 3.09
 25%|██▌       | 881/3507 [21:57<48:39,  1.11s/it]                                                  {'loss': 0.6697, 'learning_rate': 1.7545109061090236e-05, 'epoch': 0.25}
 25%|██▌       | 881/3507 [21:57<48:39,  1.11s/it]tensor([[-3.2031, -0.9922,  2.1719, -0.1230, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125e+00, -3.9375e+00, -8.0859e-01,  5.3406e-04, -3.8906e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594, -2.7031, -0.3848,  2.5469, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -3.9844, -0.6875, -0.3633, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.7188, -1.8203,  1.2344, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:43,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.35 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2812, -1.7891,  1.0391,  1.4609, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0469, -0.4609,  1.5859,  0.4141, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -2.7656,  0.1299,  1.1562, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:45,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:06:45,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.93 | bwd_microstep: 1538.87 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1537.76 | step_microstep: 1.66
[2025-11-06 18:06:45,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.31 | bwd: 1539.85 | bwd_inner: 1.93 | bwd_allreduce: 1537.81 | step: 1.75
 25%|██▌       | 882/3507 [21:59<1:04:49,  1.48s/it]                                                    {'loss': 0.3607, 'learning_rate': 1.7539043543584905e-05, 'epoch': 0.25}
 25%|██▌       | 882/3507 [21:59<1:04:49,  1.48s/it]tensor([[-1.8203, -1.5469, -0.3828,  2.3906, -0.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:45,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.13 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4844, -2.8438, -0.7344,  2.0938, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.3125, -3.0312, -0.1787,  1.0781, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -4.3750, -1.4453,  1.3828, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8906, -1.1016,  2.1562, -2.2500, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -2.6719,  0.9062,  0.2441, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[h264 @ 0x915f6c0] mmco: unref short failure
tensor([[-3.4375, -0.8477,  2.4531, -1.2422, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5625, -1.8203,  0.1484,  2.1094, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:46,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:06:46,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.74 | bwd_microstep: 703.79 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 702.71 | step_microstep: 1.75
[2025-11-06 18:06:46,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.91 | bwd: 704.63 | bwd_inner: 1.74 | bwd_allreduce: 702.74 | step: 1.83
 25%|██▌       | 883/3507 [22:00<58:59,  1.35s/it]                                                    {'loss': 1.0374, 'learning_rate': 1.753297159324027e-05, 'epoch': 0.25}
 25%|██▌       | 883/3507 [22:00<58:59,  1.35s/it]tensor([[-3.9219, -1.5703,  2.0312,  0.1016, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -2.9844,  0.1631,  0.6797, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -1.7812,  1.6250, -0.3945, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -1.2656,  2.1406, -1.3281, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:46,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.91 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.2812, -3.0938,  1.0938,  0.3965, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -0.3398,  3.0312, -0.3105, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -0.4258,  2.5469, -1.0781, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -3.3125, -0.2930,  1.2109, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:48,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:06:48,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.33 | bwd_microstep: 1770.06 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 1768.69 | step_microstep: 1.56
[2025-11-06 18:06:48,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 476.25 | bwd: 1771.00 | bwd_inner: 2.06 | bwd_allreduce: 1768.75 | step: 1.67
 25%|██▌       | 884/3507 [22:02<1:11:18,  1.63s/it]                                                    {'loss': 0.212, 'learning_rate': 1.752689321523735e-05, 'epoch': 0.25}
 25%|██▌       | 884/3507 [22:02<1:11:18,  1.63s/it]tensor([[-5.0938, -3.9688, -1.1797,  0.5039, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -0.7617,  2.4531, -0.6562, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281, -2.9844, -1.4375,  1.2656, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:49,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8750, -3.2500, -0.9062,  2.1875, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -2.4219,  0.6641, -0.6875, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -2.5781,  1.2109,  0.0403, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7969, -2.0312,  1.1641,  0.8672, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -3.5781,  0.0820, -1.1172, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:49,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:06:49,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.81 | bwd_microstep: 45.49 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 44.29 | step_microstep: 1.89
[2025-11-06 18:06:49,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.12 | bwd: 46.38 | bwd_inner: 1.93 | bwd_allreduce: 44.33 | step: 1.97
 25%|██▌       | 885/3507 [22:03<55:45,  1.28s/it]                                                    {'loss': 0.4259, 'learning_rate': 1.752080841476264e-05, 'epoch': 0.25}
 25%|██▌       | 885/3507 [22:03<55:45,  1.28s/it]tensor([[-3.8750, -2.2344,  0.5156, -0.3359, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -3.6562, -1.0156,  0.2168, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.4688,  0.0176,  1.2422, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -1.3672,  2.3438, -1.1875, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:49,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.96 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.8750, -1.7344,  1.7500,  0.2930, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -2.6875,  0.1040,  0.9297, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4062, -1.4531,  1.3047, -0.2441, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031, -0.6289,  2.6875, -0.8750, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:51,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.19 | optimizer_step: 0.31
[2025-11-06 18:06:51,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.65 | bwd_microstep: 2071.22 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2069.97 | step_microstep: 2.16
[2025-11-06 18:06:51,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 424.64 | bwd: 2071.98 | bwd_inner: 1.85 | bwd_allreduce: 2070.01 | step: 2.23
 25%|██▌       | 886/3507 [22:05<1:12:15,  1.65s/it]                                                    {'loss': 0.2545, 'learning_rate': 1.751471719700812e-05, 'epoch': 0.25}
 25%|██▌       | 886/3507 [22:05<1:12:15,  1.65s/it]tensor([[-2.9062, -0.7891,  1.9531, -0.1157, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:52,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.86 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6719, -2.6094, -0.1338,  1.5625, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -2.5938,  0.2734,  1.6406, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.1875, -7.2188, -3.2344, -3.4531, -7.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -3.1406,  0.4355, -0.2314, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -1.5781,  1.4844,  0.5078, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -3.4375, -0.6211,  0.6758, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -1.3594,  2.3750, -1.1094, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:52,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:06:52,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.56 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.44
[2025-11-06 18:06:52,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.44 | bwd: 2.92 | bwd_inner: 2.00 | bwd_allreduce: 0.80 | step: 1.52
 25%|██▌       | 887/3507 [22:06<56:11,  1.29s/it]                                                    {'loss': 0.4983, 'learning_rate': 1.7508619567171238e-05, 'epoch': 0.25}
 25%|██▌       | 887/3507 [22:06<56:11,  1.29s/it]tensor([[-4.6250, -2.8906,  0.3223, -0.1426, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250, -2.9688, -1.5469,  1.8281, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -1.9297,  1.6953, -0.8320, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -3.5312,  0.1182, -1.7891, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:52,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.59 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -2.8281,  0.4297,  0.4355, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594, -1.5156,  1.6250,  0.9219, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -1.8203,  2.0000, -0.1875, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -2.8750,  0.0815,  1.4688, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:06:53,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:06:53,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.99 | bwd_microstep: 824.87 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 823.65 | step_microstep: 2.05
[2025-11-06 18:06:53,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 416.60 | bwd: 825.75 | bwd_inner: 1.95 | bwd_allreduce: 823.68 | step: 2.13
 25%|██▌       | 888/3507 [22:07<56:04,  1.28s/it]                                                  {'loss': 0.3051, 'learning_rate': 1.7502515530454924e-05, 'epoch': 0.25}
 25%|██▌       | 888/3507 [22:07<56:04,  1.28s/it]tensor([[-3.3750, -2.5469, -0.3906,  1.6250, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9609, -1.2188,  1.0078,  3.6406, -0.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:53,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.63 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0469, -0.3477,  2.7500, -1.3516, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -1.9531,  1.4531, -0.5703, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -3.1719, -0.6055,  2.5625, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -3.2031, -0.7656,  1.6641, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -1.0547,  2.9375, -0.7695, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.0469,  0.7969,  0.1128, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:54,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 18:06:54,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.35 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.79
[2025-11-06 18:06:54,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.00 | bwd: 2.82 | bwd_inner: 1.80 | bwd_allreduce: 0.90 | step: 1.89
 25%|██▌       | 889/3507 [22:08<48:30,  1.11s/it]                                                  {'loss': 0.2261, 'learning_rate': 1.7496405092067563e-05, 'epoch': 0.25}
 25%|██▌       | 889/3507 [22:08<48:30,  1.11s/it]tensor([[-2.5625, -1.0469,  1.7344,  1.9922, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -1.7812,  1.4297,  0.5039, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -1.0234,  2.3125, -0.1514, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-9.9375, -7.3438, -2.3750, -3.8750, -7.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -2.8594,  1.0391, -0.5195, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:54,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.58 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.19
tensor([[-3.5000, -2.0469,  1.0234,  1.7109, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.2969, -0.8867,  1.5547, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -1.3906,  2.4219, -1.1641, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:54,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:06:54,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.86 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.46
[2025-11-06 18:06:54,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.46 | bwd: 2.83 | bwd_inner: 1.89 | bwd_allreduce: 0.81 | step: 1.65
 25%|██▌       | 890/3507 [22:08<40:05,  1.09it/s]                                                  {'loss': 0.2671, 'learning_rate': 1.7490288257223013e-05, 'epoch': 0.25}
 25%|██▌       | 890/3507 [22:08<40:05,  1.09it/s]tensor([[-3.1250, -2.6406, -0.5703,  2.7500, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -1.4297,  2.2344,  0.0522, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:54,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.50 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8594, -1.2891,  2.2344, -0.8984, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.8594,  0.3848,  0.4648, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -2.3438,  0.7344, -2.2812, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -3.4531, -0.4453,  0.1494, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -2.2031,  0.9766,  1.5938, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5156, -2.2656,  0.3066,  1.1016, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:06:56,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:06:56,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.54 | bwd_microstep: 1369.71 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1368.68 | step_microstep: 1.92
[2025-11-06 18:06:56,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.07 | bwd: 1370.66 | bwd_inner: 1.82 | bwd_allreduce: 1368.71 | step: 2.00
 25%|██▌       | 891/3507 [22:10<51:12,  1.17s/it]                                                  {'loss': 0.3861, 'learning_rate': 1.7484165031140582e-05, 'epoch': 0.25}
 25%|██▌       | 891/3507 [22:10<51:12,  1.17s/it]tensor([[-3.5625, -1.2891,  2.2500,  0.3906, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.0156,  1.6172, -1.2188, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:06:56,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.95 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.6250, -2.8438,  0.8516,  1.2344, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -2.6250,  0.0130,  1.3594, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -1.1172,  2.5469,  0.1865, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -2.7500,  0.1475,  1.4375, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6484, -0.1299,  2.2031,  1.3594, -1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -3.7500, -0.3984,  0.1836, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:06:57,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:06:57,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.79 | bwd_microstep: 191.92 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 190.88 | step_microstep: 1.94
[2025-11-06 18:06:57,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.76 | bwd: 192.90 | bwd_inner: 1.84 | bwd_allreduce: 190.92 | step: 2.02
 25%|██▌       | 892/3507 [22:10<43:18,  1.01it/s]                                                  {'loss': 0.5333, 'learning_rate': 1.7478035419045038e-05, 'epoch': 0.25}
 25%|██▌       | 892/3507 [22:10<43:18,  1.01it/s]tensor([[-3.3906, -3.1250, -1.7422,  1.1562, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.2812, -2.6250,  1.5000, -1.1953, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -3.9219, -0.7773, -0.5391, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:06:57,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.41 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.8438, -0.5469,  2.1406,  3.2188, -0.9414]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -1.9453,  1.9609, -0.0481, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.6250, -4.7812, -0.6758, -4.1875, -6.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7500, -2.3594,  0.2930,  0.4316, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062, -0.8867,  2.5000, -0.6836, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:07:00,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:07:00,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.16 | bwd_microstep: 2947.05 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2945.91 | step_microstep: 73.34
[2025-11-06 18:07:00,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.60 | bwd: 2947.96 | bwd_inner: 1.89 | bwd_allreduce: 2945.95 | step: 73.43
 25%|██▌       | 893/3507 [22:14<1:14:58,  1.72s/it]                                                    {'loss': 0.6714, 'learning_rate': 1.747189942616659e-05, 'epoch': 0.25}
 25%|██▌       | 893/3507 [22:14<1:14:58,  1.72s/it]tensor([[-3.6875, -2.4531,  0.6641,  2.7188, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750, -2.3750,  0.1436,  1.8359, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -2.9219,  0.6875,  1.7734, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:00,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.78 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.4062, -2.9688,  0.0403,  0.4961, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -2.9375,  0.8281,  0.3242, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -3.8594, -1.8672,  1.6172, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -3.5625,  0.3594,  1.4844, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0312,  0.2676,  3.1250,  0.3711, -1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:07:01,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:07:01,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.97 | bwd_microstep: 77.39 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 76.18 | step_microstep: 1.35
[2025-11-06 18:07:01,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.78 | bwd: 78.38 | bwd_inner: 2.04 | bwd_allreduce: 76.21 | step: 1.44
 25%|██▌       | 894/3507 [22:14<58:57,  1.35s/it]                                                    {'loss': 0.4656, 'learning_rate': 1.7465757057740905e-05, 'epoch': 0.25}
 25%|██▌       | 894/3507 [22:14<58:57,  1.35s/it]tensor([[-2.1406, -1.9297, -0.8906,  1.7344, -0.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:01,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.37 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.7188, -2.9688, -0.5508,  2.1250, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -1.2344,  1.3594,  0.5117, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -2.7812,  1.2812,  0.0562, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.2871, 2.4531, 4.4375, 1.0391, 0.0442]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.4062, -3.2031, -0.2324,  1.5156, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -1.3750,  2.0312, -0.7578, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4062, -1.1094,  1.4219,  2.1406, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:07:01,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:07:01,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.33 | bwd_microstep: 317.84 | bwd_inner_microstep: 1.52 | bwd_allreduce_microstep: 316.24 | step_microstep: 1.60
[2025-11-06 18:07:01,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.73 | bwd: 319.01 | bwd_inner: 2.59 | bwd_allreduce: 316.28 | step: 1.69
 26%|██▌       | 895/3507 [22:15<50:32,  1.16s/it]                                                  {'loss': 0.573, 'learning_rate': 1.7459608319009074e-05, 'epoch': 0.26}
 26%|██▌       | 895/3507 [22:15<50:32,  1.16s/it]tensor([[-1.7969,  0.5195,  2.4844, -1.5938, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -2.4219,  0.0918,  0.2617, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281, -3.0469, -1.5547,  1.4141, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:07:01,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.60 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8281, -1.5781,  0.6719,  0.8086, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719, -2.0781,  0.4863,  1.7266, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0938, -2.0312,  0.3164,  1.7891, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -3.5312,  0.7266,  0.1426, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5469, -0.6680,  2.0156,  0.7305, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:07:02,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:07:02,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.25 | bwd_microstep: 603.80 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 602.75 | step_microstep: 1.89
[2025-11-06 18:07:02,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.88 | bwd: 604.76 | bwd_inner: 1.84 | bwd_allreduce: 602.79 | step: 1.97
 26%|██▌       | 896/3507 [22:16<49:05,  1.13s/it]                                                  {'loss': 0.9587, 'learning_rate': 1.7453453215217634e-05, 'epoch': 0.26}
 26%|██▌       | 896/3507 [22:16<49:05,  1.13s/it]tensor([[-2.7969,  0.1357,  3.2812, -1.3984, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:02,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.70 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.3750, -2.1250, -1.0703,  1.3125, -1.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0312, -1.3125,  1.5234,  1.2031, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -0.4277,  3.1406, -1.2812, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -3.2969, -1.5078,  1.7344, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2969, -1.0859,  0.9336,  1.3359, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -1.8516,  1.9922, -0.5352, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -3.1094, -0.4238,  2.6250, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:03,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:07:03,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.29 | bwd_microstep: 2.49 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 1.05 | step_microstep: 2.02
[2025-11-06 18:07:03,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.98 | bwd: 3.50 | bwd_inner: 2.26 | bwd_allreduce: 1.09 | step: 2.11
 26%|██▌       | 897/3507 [22:17<42:56,  1.01it/s]                                                  {'loss': 0.2686, 'learning_rate': 1.744729175161855e-05, 'epoch': 0.26}
 26%|██▌       | 897/3507 [22:17<42:56,  1.01it/s]tensor([[-7.0625, -5.8438, -2.3281, -0.0427, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625, -2.0000,  1.0391,  1.3906, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:04,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.48 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.19
tensor([[-2.8906, -1.7969,  0.8359,  2.4531, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -1.8203,  0.1904,  3.1250, -1.1484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.9688,  1.7500,  0.5117, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1719, -1.2109,  1.6953, -0.0305, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -2.4375,  1.0156,  1.2031, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.6250, -0.0640,  1.9766, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:05,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:07:05,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.61 | bwd_microstep: 702.33 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 701.12 | step_microstep: 2.44
[2025-11-06 18:07:05,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.12 | bwd: 703.56 | bwd_inner: 2.28 | bwd_allreduce: 701.16 | step: 2.63
 26%|██▌       | 898/3507 [22:18<52:08,  1.20s/it]                                                  {'loss': 0.48, 'learning_rate': 1.7441123933469208e-05, 'epoch': 0.26}
 26%|██▌       | 898/3507 [22:18<52:08,  1.20s/it]tensor([[-4.7812, -2.5781,  1.2422, -0.0874, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:05,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.63 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-3.2969, -1.6797,  1.4141,  1.8359, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -2.2656,  0.6367,  1.6562, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -1.8203,  1.4141,  1.0625, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8438, -4.6250, -0.3125, -1.4766, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -2.7031,  0.5273,  0.0442, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.1406, -0.2246,  2.4219, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -2.6250,  0.6836,  0.9258, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:08,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:07:08,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.06 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.65
[2025-11-06 18:07:08,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 254.70 | bwd: 2.57 | bwd_inner: 1.59 | bwd_allreduce: 0.85 | step: 2.75
 26%|██▌       | 899/3507 [22:21<1:15:44,  1.74s/it]                                                    {'loss': 0.7026, 'learning_rate': 1.743494976603243e-05, 'epoch': 0.26}
 26%|██▌       | 899/3507 [22:21<1:15:44,  1.74s/it]tensor([[-6.2812, -4.2500, -0.0830, -0.5273, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000, -1.9922, -0.0757,  2.8594, -1.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.6094,  1.2578,  1.4609, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -4.9062, -0.8281, -0.7383, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -2.0312,  0.8555, -1.0234, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:07:08,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.61 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4922,  0.6836,  2.4688, -1.1875, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.3125, -1.1719,  1.4141, -0.8984, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.5312, -3.2031,  0.8672, -0.8594, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:08,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:07:08,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.26 | bwd_microstep: 1.63 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.04
[2025-11-06 18:07:08,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 447.89 | bwd: 2.48 | bwd_inner: 1.51 | bwd_allreduce: 0.83 | step: 2.12
 26%|██▌       | 900/3507 [22:22<59:28,  1.37s/it]                                                    {'loss': 1.2492, 'learning_rate': 1.7428769254576444e-05, 'epoch': 0.26}
 26%|██▌       | 900/3507 [22:22<59:28,  1.37s/it]tensor([[-3.8281, -2.4844,  0.1738,  0.4023, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:08,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.26 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.5625, -2.2500,  0.5391,  1.5312, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.0312,  0.0659,  0.2275, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -2.7344,  1.0234, -0.7812, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281, -3.0469, -0.7539,  1.9453, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -2.3281,  1.5938, -1.0781, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -1.2500,  1.5547, -1.2578, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -2.1562,  1.8203,  0.1416, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:10,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:07:10,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.04 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.83 | step_microstep: 104.79
[2025-11-06 18:07:10,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.32 | bwd: 2.71 | bwd_inner: 1.75 | bwd_allreduce: 0.86 | step: 104.87
 26%|██▌       | 901/3507 [22:24<1:09:12,  1.59s/it]                                                    {'loss': 0.3201, 'learning_rate': 1.7422582404374893e-05, 'epoch': 0.26}
 26%|██▌       | 901/3507 [22:24<1:09:12,  1.59s/it]tensor([[-1.2578,  0.7617,  2.4688, -0.7695, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5469, -1.3203,  2.0469,  0.2412, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:10,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.78 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.3906, -2.4844, -0.0427,  1.9688, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781, -1.3359,  1.7422, -0.8477, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6094, -2.1875, -0.4512,  2.4062, -1.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2656, -0.7031,  2.2812, -1.0703, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -1.5469,  1.7266, -0.9688, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1875, -2.8438,  0.0986,  1.0391, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:07:11,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:07:11,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.88 | bwd_microstep: 132.23 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 131.25 | step_microstep: 1.69
[2025-11-06 18:07:11,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 494.69 | bwd: 133.13 | bwd_inner: 1.70 | bwd_allreduce: 131.29 | step: 1.78
 26%|██▌       | 902/3507 [22:25<57:09,  1.32s/it]                                                    {'loss': 0.5021, 'learning_rate': 1.7416389220706836e-05, 'epoch': 0.26}
 26%|██▌       | 902/3507 [22:25<57:09,  1.32s/it]tensor([[-3.6562, -1.6562,  1.7812,  1.1484, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:11,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.02 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.6094, -1.8281,  1.1250,  0.4219, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -2.2812,  0.7461,  1.0547, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -0.6562,  2.6875,  1.1484, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[h264 @ 0x98ad880] mmco: unref short failure
tensor([[-4.3125, -3.3906, -0.5586,  1.9219, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -2.5156,  1.6875, -1.2656, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -4.0000, -1.6016,  1.5703, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -1.5547,  1.4219,  0.9688, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:13,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 18:07:13,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.33 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.14
[2025-11-06 18:07:13,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.37 | bwd: 2.58 | bwd_inner: 1.58 | bwd_allreduce: 0.87 | step: 2.24
 26%|██▌       | 903/3507 [22:27<1:12:56,  1.68s/it]                                                    {'loss': 0.3507, 'learning_rate': 1.7410189708856725e-05, 'epoch': 0.26}
 26%|██▌       | 903/3507 [22:27<1:12:56,  1.68s/it]tensor([[-2.8438, -0.7695,  2.1875,  0.6211, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6641,  0.7070,  2.7188, -0.8789, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6406, -1.9922,  0.3320,  3.3906, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:14,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.30 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2188, -2.9688,  0.0215,  1.1797, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -2.4219,  1.2578,  0.3477, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -1.9375,  1.4609, -0.2158, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -2.4375,  0.6797,  1.6641, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9062, -1.1719,  1.8984,  1.7891, -1.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:07:14,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:07:14,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.65 | bwd_microstep: 35.91 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 34.88 | step_microstep: 1.77
[2025-11-06 18:07:14,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.98 | bwd: 36.89 | bwd_inner: 1.84 | bwd_allreduce: 34.92 | step: 1.86
 26%|██▌       | 904/3507 [22:28<56:43,  1.31s/it]                                                    {'loss': 0.4243, 'learning_rate': 1.7403983874114422e-05, 'epoch': 0.26}
 26%|██▌       | 904/3507 [22:28<56:43,  1.31s/it]tensor([[-3.9219, -2.6875,  0.2197,  1.8203, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -3.1094,  0.1641,  1.7578, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219, -2.2969,  0.5234,  0.0486, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:14,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.38 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4688, -0.8594,  2.8125, -0.7539, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -3.0938, -0.3965,  1.8125, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -2.5469,  0.2432,  1.5938, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -3.5781, -1.4062,  1.8438, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3281, -0.2910,  2.2031, -0.4863, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:15,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:07:15,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.98 | bwd_microstep: 1.99 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.24
[2025-11-06 18:07:15,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.38 | bwd: 2.97 | bwd_inner: 1.96 | bwd_allreduce: 0.88 | step: 2.33
 26%|██▌       | 905/3507 [22:29<58:47,  1.36s/it]                                                  {'loss': 0.2614, 'learning_rate': 1.7397771721775174e-05, 'epoch': 0.26}
 26%|██▌       | 905/3507 [22:29<58:47,  1.36s/it]tensor([[-3.3281, -1.4141,  1.7812,  0.7344, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -3.3906, -1.2266,  1.7031, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.6406,  0.8008,  0.4941, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:16,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.99 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1250, -2.4219,  1.1094,  1.3438, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250, -0.4805,  2.9375, -0.6250, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -2.6875, -1.5469,  1.1719, -1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.2812, -2.1250, -1.0625,  1.8750, -1.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2031, -0.6758,  1.3906,  0.6953, -1.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:07:16,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:07:16,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.80 | bwd_microstep: 74.49 | bwd_inner_microstep: 1.57 | bwd_allreduce_microstep: 72.84 | step_microstep: 2.62
[2025-11-06 18:07:16,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.81 | bwd: 75.26 | bwd_inner: 2.27 | bwd_allreduce: 72.87 | step: 2.70
 26%|██▌       | 906/3507 [22:30<47:04,  1.09s/it]                                                  {'loss': 0.7387, 'learning_rate': 1.7391553257139626e-05, 'epoch': 0.26}
 26%|██▌       | 906/3507 [22:30<47:04,  1.09s/it]tensor([[-6.1562, -3.6250,  0.9141, -1.0156, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:16,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.11 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0312, -2.2344,  0.0271,  2.2969, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -1.9609,  2.0938, -0.0623, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -1.9375,  1.9844,  0.1738, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0781, -2.3750, -0.0557,  2.6406, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4570,  1.7656,  3.7188, -0.1562, -0.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -1.8047,  2.6094, -0.5508, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0625, -0.6680,  2.0469,  2.4219, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:17,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:07:17,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.47 | bwd_microstep: 2.31 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.56
[2025-11-06 18:07:17,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.58 | bwd: 3.15 | bwd_inner: 2.17 | bwd_allreduce: 0.86 | step: 2.64
 26%|██▌       | 907/3507 [22:31<54:15,  1.25s/it]                                                  {'loss': 0.1833, 'learning_rate': 1.7385328485513804e-05, 'epoch': 0.26}
 26%|██▌       | 907/3507 [22:31<54:15,  1.25s/it]tensor([[-3.4531, -0.8594,  1.6719, -2.3281, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -2.1562,  0.6055,  2.4688, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -2.1094,  0.5352,  2.8750, -1.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000, -0.3223,  1.9297, -1.0547, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:07:18,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.33 | bwd_microstep: 1.36 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -3.0469, -0.0286,  1.6406, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -2.7344, -0.1562,  2.8281, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -2.9219, -0.3340,  1.3984, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4688,  0.0811,  3.3594,  0.0530, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:18,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:07:18,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.39 | bwd_microstep: 1.91 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 0.67 | step_microstep: 2.63
[2025-11-06 18:07:18,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.75 | bwd: 3.27 | bwd_inner: 2.44 | bwd_allreduce: 0.70 | step: 2.72
 26%|██▌       | 908/3507 [22:32<43:29,  1.00s/it]                                                  {'loss': 0.6495, 'learning_rate': 1.737909741220913e-05, 'epoch': 0.26}
 26%|██▌       | 908/3507 [22:32<43:29,  1.00s/it]tensor([[-1.8906, -1.5312,  0.2197,  3.5625, -0.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3965, 2.0469, 4.1875, 3.4062, 0.5742]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:18,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.06 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5781, -1.2891,  2.0938, -0.0669, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -1.7500,  1.9609,  0.3945, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7188, -3.8281,  0.0459, -0.2500, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -2.3594,  1.2812, -0.2988, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3906, -2.7500, -0.6328,  1.9766, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -4.0625, -0.5586, -1.0781, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:21,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.39 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:07:21,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.99 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.91 | step_microstep: 3.28
[2025-11-06 18:07:21,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.06 | bwd: 2.99 | bwd_inner: 1.91 | bwd_allreduce: 0.94 | step: 3.35
 26%|██▌       | 909/3507 [22:35<1:07:42,  1.56s/it]                                                    {'loss': 0.521, 'learning_rate': 1.737286004254238e-05, 'epoch': 0.26}
 26%|██▌       | 909/3507 [22:35<1:07:42,  1.56s/it]tensor([[-3.1562, -1.3828,  1.1016, -0.1484, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4375, -1.2578,  1.7344, -0.2197, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -3.8594, -1.2812,  1.7891, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:21,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.28 | bwd_microstep: 1.48 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-2.8125, -0.8164,  1.9453,  0.1309, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -2.2969,  0.8789, -0.4238, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -2.2188,  1.6016, -1.5625, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -3.3281, -1.4375,  0.5469, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -3.7031, -1.0938,  0.7695, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:07:21,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:07:21,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.81 | bwd_microstep: 21.38 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 20.22 | step_microstep: 2.75
[2025-11-06 18:07:21,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.14 | bwd: 22.85 | bwd_inner: 2.40 | bwd_allreduce: 20.28 | step: 2.88
 26%|██▌       | 910/3507 [22:35<53:14,  1.23s/it]                                                    {'loss': 0.589, 'learning_rate': 1.7366616381835715e-05, 'epoch': 0.26}
 26%|██▌       | 910/3507 [22:35<53:14,  1.23s/it]tensor([[-4.2188, -2.4531,  0.8633,  0.4902, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:21,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.61 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.9688, -2.7812, -1.2031,  2.2969, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0781, -2.3281, -0.1582,  1.7188, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5469, -1.5625,  0.3457,  1.3438, -1.5703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -2.6562, -0.9414,  2.0000, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.2969,  0.6133,  0.0349, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1094, -1.8359,  0.7656,  1.7422, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.6875, -0.9102,  1.5078, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:23,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.13 | optimizer_step: 0.20
[2025-11-06 18:07:23,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.58
[2025-11-06 18:07:23,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.31 | bwd: 2.71 | bwd_inner: 1.72 | bwd_allreduce: 0.87 | step: 2.66
 26%|██▌       | 911/3507 [22:37<1:06:54,  1.55s/it]                                                    {'loss': 0.8738, 'learning_rate': 1.7360366435416668e-05, 'epoch': 0.26}
 26%|██▌       | 911/3507 [22:37<1:06:54,  1.55s/it]tensor([[-4.7188, -2.3438,  1.5703, -0.1504, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -1.4922,  1.8984, -0.4199, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -2.8906, -0.4961,  2.6250, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:24,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.99 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4062,  0.0349,  2.4688, -1.2734, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.9688, -2.7656,  0.1924,  1.6953, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -2.7969,  1.1016,  0.8125, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -2.7656, -0.5352,  1.3672, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -1.3594,  2.2188, -2.0156, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:07:24,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:07:24,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.73 | bwd_microstep: 22.53 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 21.52 | step_microstep: 2.28
[2025-11-06 18:07:24,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.75 | bwd: 23.38 | bwd_inner: 1.69 | bwd_allreduce: 21.55 | step: 2.36
 26%|██▌       | 912/3507 [22:38<51:34,  1.19s/it]                                                    {'loss': 0.5304, 'learning_rate': 1.7354110208618124e-05, 'epoch': 0.26}
 26%|██▌       | 912/3507 [22:38<51:34,  1.19s/it]tensor([[4.1250, 5.8750, 6.3750, 3.2344, 3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0312, -2.1250,  1.2500,  0.4492, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -2.7812, -0.0903,  1.8672, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -1.9375,  1.8828,  0.1748, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:24,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.21 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9688, -1.8438,  1.6250,  0.1235, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750, -2.7500, -0.4199,  2.5312, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.0156,  1.4922,  0.6016, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -0.9922,  1.4453, -0.5977, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:25,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:07:25,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.95 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.94
[2025-11-06 18:07:25,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.18 | bwd: 2.73 | bwd_inner: 1.72 | bwd_allreduce: 0.89 | step: 2.03
 26%|██▌       | 913/3507 [22:39<56:42,  1.31s/it]                                                  {'loss': 0.5983, 'learning_rate': 1.7347847706778344e-05, 'epoch': 0.26}
 26%|██▌       | 913/3507 [22:39<56:42,  1.31s/it]tensor([[-3.5469, -2.4062,  0.2393,  1.7734, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -1.5000,  1.3516, -1.6094, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:26,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.81 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -4.1562, -1.1641,  0.7422, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.1875, -0.1260,  2.1250, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9688, -0.7891,  1.9297, -0.3359, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4219, -2.4531,  0.0947,  1.7188, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.7344,  1.4062,  0.4141, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281, -2.7188, -0.3809,  2.8438, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:26,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 18:07:26,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 72.66 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 71.30 | step_microstep: 1.61
[2025-11-06 18:07:26,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.27 | bwd: 73.54 | bwd_inner: 2.09 | bwd_allreduce: 71.33 | step: 1.68
 26%|██▌       | 914/3507 [22:40<45:13,  1.05s/it]                                                  {'loss': 0.5008, 'learning_rate': 1.7341578935240922e-05, 'epoch': 0.26}
 26%|██▌       | 914/3507 [22:40<45:13,  1.05s/it]tensor([[-3.8594, -3.2500, -1.0312,  1.7422, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -1.0078,  2.9219,  1.2188, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:26,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.17 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.3281, -2.1250,  0.1641,  0.6602, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -2.9688,  0.8398,  0.6367, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -2.9219,  0.6367,  1.3047, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.5781,  1.0781,  0.7305, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -3.2812,  0.8203,  0.4707, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0156, -0.4785,  2.7500, -0.5078, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:27,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:07:27,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.39 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.67
[2025-11-06 18:07:27,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.57 | bwd: 3.00 | bwd_inner: 1.97 | bwd_allreduce: 0.89 | step: 1.76
 26%|██▌       | 915/3507 [22:40<41:31,  1.04it/s]                                                  {'loss': 0.4961, 'learning_rate': 1.7335303899354818e-05, 'epoch': 0.26}
 26%|██▌       | 915/3507 [22:40<41:31,  1.04it/s]tensor([[-3.4688, -2.6562, -0.2578,  2.1094, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4375, -0.0645,  3.4062,  1.1094, -1.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5312, -0.4531,  1.7422,  2.5000, -0.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -1.4922,  2.1875,  0.4180, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -0.6523,  2.6562, -0.8789, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1250,  1.2188,  3.2188, -0.4785, -1.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:07:28,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.88 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.1562, -2.9688,  0.0791,  2.0625, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -3.6719, -0.9688,  0.2793, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:07:29,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:07:29,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.64 | bwd_microstep: 400.19 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 399.01 | step_microstep: 1.73
[2025-11-06 18:07:29,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.54 | bwd: 401.20 | bwd_inner: 2.00 | bwd_allreduce: 399.05 | step: 1.82
 26%|██▌       | 916/3507 [22:42<54:55,  1.27s/it]                                                  {'loss': 1.1088, 'learning_rate': 1.732902260447433e-05, 'epoch': 0.26}
 26%|██▌       | 916/3507 [22:42<54:55,  1.27s/it]tensor([[-3.6094, -3.2344, -1.2266,  2.1094, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0547, -0.8398,  0.5938,  4.2812,  0.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4531, -2.5156, -0.0391,  1.9922, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:29,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.88 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.09
tensor([[-4.0938, -2.8906,  0.1455,  1.9766, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4531, -0.2061,  2.5781, -0.0464, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -2.0781,  0.7969,  1.0938, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4844, -2.4062,  0.1885,  1.7500, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812, -1.4922,  1.4141,  0.3418, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:29,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:07:29,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.48 | bwd_microstep: 1.63 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.61 | step_microstep: 1.34
[2025-11-06 18:07:29,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.39 | bwd: 2.57 | bwd_inner: 1.81 | bwd_allreduce: 0.64 | step: 1.43
 26%|██▌       | 917/3507 [22:43<43:48,  1.01s/it]                                                  {'loss': 0.3895, 'learning_rate': 1.7322735055959095e-05, 'epoch': 0.26}
 26%|██▌       | 917/3507 [22:43<43:48,  1.01s/it]tensor([[-2.2031, -0.4082,  2.2812,  1.4766, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -3.0469, -0.4707,  1.6641, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.6328,  2.1094, -0.6328, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[3.3125, 4.6562, 5.9062, 5.7188, 3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8438, -1.5312,  1.1016,  1.7031, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -2.7812,  0.4531,  1.6641, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7344, -2.3281,  0.5859,  1.2422, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:32,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.27 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.7500, -2.3125,  1.6797, -0.4844, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:32,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.12 | optimizer_step: 0.14
[2025-11-06 18:07:32,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.45 | bwd_microstep: 1.79 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.72 | step_microstep: 2.27
[2025-11-06 18:07:32,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 464.70 | bwd: 2.78 | bwd_inner: 1.88 | bwd_allreduce: 0.76 | step: 2.37
 26%|██▌       | 918/3507 [22:46<1:08:51,  1.60s/it]                                                    {'loss': 0.5335, 'learning_rate': 1.7316441259174092e-05, 'epoch': 0.26}
 26%|██▌       | 918/3507 [22:46<1:08:51,  1.60s/it]tensor([[-4.2500, -2.5625,  0.6797,  0.8164, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -4.1250, -1.1094,  0.3516, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:32,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.15 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1875, -2.1094,  1.5859,  0.4102, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -3.7188, -0.1172,  0.8906, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -1.4922,  1.8906, -1.0469, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5781, -1.8281,  0.1855,  2.3906, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2969, -0.1206,  2.2656, -0.6914, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -3.0000, -0.5195,  2.0312, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:32,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:07:32,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.64 | bwd_microstep: 134.92 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 133.76 | step_microstep: 1.50
[2025-11-06 18:07:32,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.81 | bwd: 135.75 | bwd_inner: 1.83 | bwd_allreduce: 133.80 | step: 1.57
 26%|██▌       | 919/3507 [22:46<54:25,  1.26s/it]                                                    {'loss': 0.2679, 'learning_rate': 1.7310141219489633e-05, 'epoch': 0.26}
 26%|██▌       | 919/3507 [22:46<54:25,  1.26s/it]tensor([[-5.1250, -2.6875,  0.5312, -2.5938, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5938, -2.8594,  0.5938,  0.4941, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906, -1.8359,  1.1641,  1.8047, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -2.6719,  1.3984, -1.3125, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:33,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.48 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.3750, -0.5391,  2.9375, -1.0234, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -1.5781,  1.6094, -0.2021, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-10.6875,  -8.3125,  -3.5938,  -4.4375,  -8.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4219, -0.2773,  2.5000,  0.4336, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:35,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:07:35,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.42 | bwd_microstep: 2.08 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.46
[2025-11-06 18:07:35,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.91 | bwd: 2.88 | bwd_inner: 1.86 | bwd_allreduce: 0.90 | step: 2.53
 26%|██▌       | 920/3507 [22:49<1:06:39,  1.55s/it]                                                    {'loss': 0.7464, 'learning_rate': 1.7303834942281346e-05, 'epoch': 0.26}
 26%|██▌       | 920/3507 [22:49<1:06:39,  1.55s/it]tensor([[-2.6875, -0.7617,  1.6406, -0.1523, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -3.2188, -0.4863,  1.0156, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719, -2.3438,  0.0615,  2.5156, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -3.5156, -1.6562,  1.5625, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:35,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.55 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.3887, -0.0308,  1.1250,  3.8594,  0.4824]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.5625, -3.6094, -2.1719,  1.9609, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -2.8438,  0.0737,  2.2031, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -3.9844,  0.0304,  0.2480, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:07:35,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:07:35,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.34 | bwd_microstep: 93.51 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 92.49 | step_microstep: 1.70
[2025-11-06 18:07:35,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.92 | bwd: 94.38 | bwd_inner: 1.72 | bwd_allreduce: 92.53 | step: 1.78
 26%|██▋       | 921/3507 [22:49<52:51,  1.23s/it]                                                    {'loss': 0.6451, 'learning_rate': 1.7297522432930193e-05, 'epoch': 0.26}
 26%|██▋       | 921/3507 [22:49<52:51,  1.23s/it]tensor([[-3.0625, -0.8438,  2.1406, -0.4102, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:35,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4688, -3.1094, -0.0854, -2.8125, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -1.5938,  1.3359, -0.3574, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8906, -0.6523,  2.2969, -0.0972, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2656,  0.1533,  3.4844,  1.0234, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4375, -1.1484,  1.4922,  2.8438, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.7500, -0.1865,  1.5703,  4.3750,  0.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -2.8281,  0.2773,  0.2559, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:07:36,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:07:36,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.92 | bwd_microstep: 904.21 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 903.24 | step_microstep: 1.71
[2025-11-06 18:07:36,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.49 | bwd: 905.01 | bwd_inner: 1.59 | bwd_allreduce: 903.28 | step: 1.79
 26%|██▋       | 922/3507 [22:50<53:47,  1.25s/it]                                                  {'loss': 0.2306, 'learning_rate': 1.729120369682244e-05, 'epoch': 0.26}
 26%|██▋       | 922/3507 [22:50<53:47,  1.25s/it]tensor([[-3.1406, -0.6406,  2.4375, -0.6133, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:37,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.53 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6875, -0.4551,  2.3281, -0.4629, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -2.4688,  0.1709,  1.6484, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -2.8125, -0.6602,  2.7500, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3594, -2.9219, -1.0625,  1.7266, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3906,  0.2393,  3.3750, -0.6133, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -1.9297,  0.9688,  0.1611, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -2.6094, -0.2969,  2.3594, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:38,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:07:38,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.64 | bwd_microstep: 1141.99 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1140.87 | step_microstep: 127.24
[2025-11-06 18:07:38,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 220.19 | bwd: 1142.90 | bwd_inner: 1.84 | bwd_allreduce: 1140.91 | step: 127.33
 26%|██▋       | 923/3507 [22:52<57:13,  1.33s/it]                                                  {'loss': 0.1499, 'learning_rate': 1.728487873934969e-05, 'epoch': 0.26}
 26%|██▋       | 923/3507 [22:52<57:13,  1.33s/it]tensor([[-2.2344, -2.2188, -1.2891,  1.7500, -0.9336]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.6094, -2.9219, -0.8008,  1.4219, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0625, -2.5625, -0.4570,  2.6719, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -1.2344,  1.6562,  0.2383, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -2.2969,  1.7031, -0.4961, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -2.9219,  0.3789,  1.8594, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:39,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.72 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.2500, -3.7656, -0.5977,  0.1914, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2500, -3.2188,  0.7188,  0.1504, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:07:40,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:07:40,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.56 | bwd_microstep: 1203.42 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1202.16 | step_microstep: 2.11
[2025-11-06 18:07:40,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.31 | bwd: 1204.40 | bwd_inner: 2.01 | bwd_allreduce: 1202.21 | step: 2.21
 26%|██▋       | 924/3507 [22:54<1:05:45,  1.53s/it]                                                    {'loss': 1.268, 'learning_rate': 1.7278547565908823e-05, 'epoch': 0.26}
 26%|██▋       | 924/3507 [22:54<1:05:45,  1.53s/it]tensor([[-4.2500, -2.8906,  0.3125,  1.5781, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:40,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.30 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3750, -3.3750,  0.4160, -0.0167, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2344,  0.4199,  3.3750, -0.8242, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9688, -2.7969, -1.3438,  2.0312, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.2344, -2.9844, -1.6406,  1.0781, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -2.8281,  1.6641, -0.9180, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -0.8828,  2.7969, -0.8633, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -4.1250,  0.0137, -0.7188, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:44,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:07:44,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.17 | bwd_microstep: 3155.44 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 3154.32 | step_microstep: 2.20
[2025-11-06 18:07:44,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.50 | bwd: 3156.46 | bwd_inner: 1.96 | bwd_allreduce: 3154.37 | step: 2.29
 26%|██▋       | 925/3507 [22:57<1:32:02,  2.14s/it]                                                    {'loss': 0.7911, 'learning_rate': 1.7272210181902044e-05, 'epoch': 0.26}
 26%|██▋       | 925/3507 [22:57<1:32:02,  2.14s/it]tensor([[-4.8438, -2.6406,  1.2969,  0.2578, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6875, -0.3418,  2.3750, -0.1895, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.7656, -2.1406,  0.9688,  1.0547, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:44,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.49 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.5469, -2.6562, -0.1289,  1.9141, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -1.3438,  1.9922, -0.2656, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -4.2188, -0.1514, -0.1611, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656, -1.7344,  1.3359,  1.8672, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -6.0000, -2.9062,  0.6406, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:44,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:07:44,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.54
[2025-11-06 18:07:44,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.80 | bwd: 3.07 | bwd_inner: 2.16 | bwd_allreduce: 0.80 | step: 1.61
 26%|██▋       | 926/3507 [22:58<1:09:48,  1.62s/it]                                                    {'loss': 0.7405, 'learning_rate': 1.726586659273686e-05, 'epoch': 0.26}
 26%|██▋       | 926/3507 [22:58<1:09:48,  1.62s/it]tensor([[-1.6484,  0.6016,  2.7344, -0.7578, -1.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.4375, -0.3242,  2.0156, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:44,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.78 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3906, -0.9141,  1.1484,  0.6406, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -1.9062,  2.0000,  0.9414, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.0000,  1.2578, -0.9062, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -2.6406,  1.2344,  0.0845, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -1.7266,  1.1250,  0.4082, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -2.8125,  0.0308,  2.4375, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:45,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:07:45,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.76 | bwd_microstep: 465.70 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 464.48 | step_microstep: 1.89
[2025-11-06 18:07:45,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.57 | bwd: 466.66 | bwd_inner: 2.01 | bwd_allreduce: 464.52 | step: 1.98
 26%|██▋       | 927/3507 [22:59<59:56,  1.39s/it]                                                    {'loss': 0.4915, 'learning_rate': 1.7259516803826054e-05, 'epoch': 0.26}
 26%|██▋       | 927/3507 [22:59<59:56,  1.39s/it]tensor([[-5.3750, -4.8125, -2.0156,  1.7734, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:45,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.66 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.0625, -2.3125,  0.8203,  0.4941, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -3.0156, -0.7227,  1.2812, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -3.6094,  0.0874,  1.4531, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -1.2188,  2.2500,  0.4648, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -1.8203,  1.1953,  1.6719, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.1719,  0.6719,  1.2734, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -3.5312, -0.8359,  2.1406, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:07:45,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:07:45,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.64 | bwd_microstep: 250.64 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 249.45 | step_microstep: 1.68
[2025-11-06 18:07:45,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.32 | bwd: 251.47 | bwd_inner: 1.87 | bwd_allreduce: 249.48 | step: 1.75
 26%|██▋       | 928/3507 [22:59<49:31,  1.15s/it]                                                  {'loss': 0.3562, 'learning_rate': 1.7253160820587718e-05, 'epoch': 0.26}
 26%|██▋       | 928/3507 [22:59<49:31,  1.15s/it]tensor([[-2.2812, -1.2656,  1.0078,  2.7500, -1.1797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -3.3281,  0.9219,  0.5391, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:46,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.56 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9688, -3.8906, -0.7969,  1.1719, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9844, -0.5938,  2.2969, -0.5742, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -2.6719, -0.5625,  2.2344, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -4.4688, -1.6016,  1.5703, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.1406,  0.6445,  0.7617, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8516, -1.6562, -0.5781,  2.1094, -0.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:47,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:07:47,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.05 | bwd_microstep: 1246.81 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1245.62 | step_microstep: 1.80
[2025-11-06 18:07:47,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 515.64 | bwd: 1247.62 | bwd_inner: 1.84 | bwd_allreduce: 1245.66 | step: 1.87
 26%|██▋       | 929/3507 [23:01<57:56,  1.35s/it]                                                  {'loss': 0.812, 'learning_rate': 1.7246798648445216e-05, 'epoch': 0.26}
 26%|██▋       | 929/3507 [23:01<57:56,  1.35s/it]tensor([[-3.8125, -2.6562, -0.1270,  0.5586, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:47,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.80 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.4062, -2.0625, -0.0850,  3.6094, -0.9492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -1.7109,  2.0781,  0.0106, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1094, -0.0889,  2.5469,  0.5586, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -3.9688, -1.7266,  1.8672, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -2.4375,  0.4336,  1.1328, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -1.5859,  1.9062, -0.7500, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688e+00, -3.1875e+00,  1.7166e-03,  1.7031e+00, -2.9219e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:48,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:07:48,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.77 | bwd_microstep: 633.11 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 631.80 | step_microstep: 1.79
[2025-11-06 18:07:48,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.59 | bwd: 634.16 | bwd_inner: 2.20 | bwd_allreduce: 631.83 | step: 1.86
 27%|██▋       | 930/3507 [23:02<53:01,  1.23s/it]                                                  {'loss': 0.2904, 'learning_rate': 1.7240430292827205e-05, 'epoch': 0.27}
 27%|██▋       | 930/3507 [23:02<53:01,  1.23s/it]tensor([[-3.3438, -0.8867,  1.8828, -1.6562, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250, -1.0703,  2.4375, -0.4590, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -0.9258,  2.1094, -0.6367, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -2.2500,  1.6484, -0.6133, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:48,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.53 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2500, -1.8438,  1.0312,  1.7031, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7891,  0.5430,  3.2500,  0.6758, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.5156,  1.3359,  0.3828, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -2.3281,  1.6094,  0.1943, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:49,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:07:49,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.72 | bwd_microstep: 2.19 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.95 | step_microstep: 2.10
[2025-11-06 18:07:49,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.27 | bwd: 3.11 | bwd_inner: 1.97 | bwd_allreduce: 0.99 | step: 2.18
 27%|██▋       | 931/3507 [23:02<42:50,  1.00it/s]                                                  {'loss': 0.3208, 'learning_rate': 1.7234055759167602e-05, 'epoch': 0.27}
 27%|██▋       | 931/3507 [23:02<42:50,  1.00it/s]tensor([[-3.7500, -2.0000,  0.9336,  0.7227, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:49,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -1.7266,  1.2422, -1.5312, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -3.6562, -0.0544, -0.5508, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1250, -3.0625, -0.2969,  1.3594, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -1.3281,  1.4531,  0.5664, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1094, -1.0000,  2.2969,  0.6562, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.5469, -0.2793,  0.1279, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938, -2.6250,  0.1689,  2.6875, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:51,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:07:51,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.80 | bwd_microstep: 1479.43 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1478.21 | step_microstep: 2.14
[2025-11-06 18:07:51,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.59 | bwd: 1480.32 | bwd_inner: 1.91 | bwd_allreduce: 1478.26 | step: 2.22
 27%|██▋       | 932/3507 [23:04<54:08,  1.26s/it]                                                  {'loss': 0.3676, 'learning_rate': 1.7227675052905613e-05, 'epoch': 0.27}
 27%|██▋       | 932/3507 [23:04<54:08,  1.26s/it]tensor([[-4.1250, -2.0625,  1.6875,  0.9141, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:51,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.44 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1875, -2.0156,  1.8828,  0.8281, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438, -1.0000,  1.9922, -0.8516, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7812, -2.9062, -0.0820,  2.7031, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5156, -1.6172,  0.3145,  1.3359, -1.5703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2656,  0.0850,  2.0469,  1.3984, -0.8398]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -3.6094, -1.1172,  0.8828, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -2.3750,  1.1953,  1.7969, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:07:52,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:07:52,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.07 | bwd_microstep: 1303.40 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1302.20 | step_microstep: 2.31
[2025-11-06 18:07:52,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.53 | bwd: 1304.35 | bwd_inner: 1.96 | bwd_allreduce: 1302.24 | step: 2.40
 27%|██▋       | 933/3507 [23:06<59:36,  1.39s/it]                                                  {'loss': 0.8372, 'learning_rate': 1.72212881794857e-05, 'epoch': 0.27}
 27%|██▋       | 933/3507 [23:06<59:36,  1.39s/it]tensor([[-7.9062, -6.1250, -1.9062, -1.6719, -6.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -3.4062, -1.5469,  2.1094, -1.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2188, -0.5742,  2.8438, -0.9531, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:53,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.49 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8281, -0.2109,  3.1094, -0.0757, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4844, -2.7344, -0.3340,  2.1719, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -3.0625, -0.3379,  1.0781, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -1.1484,  2.4688, -0.5195, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -2.4375,  1.9297, -0.6172, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:07:54,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:07:54,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.58 | bwd_microstep: 895.24 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 894.06 | step_microstep: 1.88
[2025-11-06 18:07:54,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 466.10 | bwd: 896.25 | bwd_inner: 2.02 | bwd_allreduce: 894.11 | step: 1.96
 27%|██▋       | 934/3507 [23:07<59:46,  1.39s/it]                                                  {'loss': 0.1929, 'learning_rate': 1.7214895144357592e-05, 'epoch': 0.27}
 27%|██▋       | 934/3507 [23:07<59:46,  1.39s/it]tensor([[-4.0938, -2.2344,  1.2344,  1.2422, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8594, -2.8438, -0.1396,  1.6094, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.8125, -1.0312,  1.9844, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:54,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.26 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9531, -3.8594, -2.2500,  1.1328, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5312, -0.4863,  1.7031,  3.2188, -0.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -1.8203,  1.6562,  1.3203, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8750, -4.2500, -0.2871,  0.8828, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.6094,  0.7188,  1.8906, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:07:55,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:07:55,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.69 | bwd_microstep: 978.00 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 976.93 | step_microstep: 2.51
[2025-11-06 18:07:55,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 500.98 | bwd: 978.84 | bwd_inner: 1.73 | bwd_allreduce: 976.97 | step: 2.59
 27%|██▋       | 935/3507 [23:09<1:01:23,  1.43s/it]                                                    {'loss': 0.4499, 'learning_rate': 1.7208495952976273e-05, 'epoch': 0.27}
 27%|██▋       | 935/3507 [23:09<1:01:23,  1.43s/it]tensor([[-4.2188, -2.0156,  1.4766, -0.2773, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:55,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.89 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3438, -4.3438, -1.3281,  0.2266, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5312, -2.5000,  1.1094,  0.2500, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -2.6406,  1.3828, -1.3906, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -2.6562,  0.2041,  2.6875, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8125, -2.8438, -0.2812,  1.4062, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -3.3438,  0.7930,  0.7227, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0312, -0.3418,  2.6875,  2.9062, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:07:56,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:07:56,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.12 | bwd_microstep: 586.83 | bwd_inner_microstep: 1.51 | bwd_allreduce_microstep: 585.23 | step_microstep: 2.62
[2025-11-06 18:07:56,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 262.05 | bwd: 587.72 | bwd_inner: 2.28 | bwd_allreduce: 585.28 | step: 2.71
 27%|██▋       | 936/3507 [23:10<54:18,  1.27s/it]                                                    {'loss': 0.9155, 'learning_rate': 1.7202090610801975e-05, 'epoch': 0.27}
 27%|██▋       | 936/3507 [23:10<54:18,  1.27s/it]tensor([[-2.0625,  0.1240,  2.0781, -0.9609, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.4688, -2.2344, -0.9609,  1.7031, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:56,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.7969, -3.1094, -0.5781,  2.4688, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -1.4766,  1.0078,  1.4531, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -2.6094,  1.5156,  0.2285, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656, -0.7148,  2.7031, -0.2695, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3125,  0.1104,  2.6875, -0.9453, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[1.6641, 3.5312, 5.0938, 3.1719, 1.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:07:56,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:07:56,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.05 | bwd_microstep: 98.71 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 97.85 | step_microstep: 2.06
[2025-11-06 18:07:56,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.05 | bwd: 99.76 | bwd_inner: 1.71 | bwd_allreduce: 97.90 | step: 2.17
 27%|██▋       | 937/3507 [23:10<43:21,  1.01s/it]                                                  {'loss': 0.7885, 'learning_rate': 1.7195679123300192e-05, 'epoch': 0.27}
 27%|██▋       | 937/3507 [23:10<43:21,  1.01s/it]tensor([[-4.5625, -3.0156,  0.4023,  0.9922, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:07:57,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.32 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -3.5156, -0.4629,  1.7500, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -1.1875,  1.5625, -1.5078, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5000, -3.2188, -0.1826,  0.8984, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -1.5469,  2.2500, -0.8086, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -0.9062,  2.7656, -0.2031, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -0.4160,  2.7656, -1.4688, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -2.7969,  0.7070,  0.4512, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:07:58,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.24
[2025-11-06 18:07:58,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.09 | bwd_microstep: 1594.56 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 1593.70 | step_microstep: 2.53
[2025-11-06 18:07:58,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.43 | bwd: 1595.26 | bwd_inner: 1.34 | bwd_allreduce: 1593.76 | step: 2.61
 27%|██▋       | 938/3507 [23:12<55:19,  1.29s/it]                                                  {'loss': 0.6477, 'learning_rate': 1.7189261495941648e-05, 'epoch': 0.27}
 27%|██▋       | 938/3507 [23:12<55:19,  1.29s/it]tensor([[-4.6250, -3.7031, -0.9727,  0.9609, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.0469,  0.3105,  0.6992, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:07:59,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.48 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.6250, -2.7969,  1.0938,  1.2422, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594, -0.2793,  2.9375, -0.4531, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4219,  0.2832,  3.0156,  2.9688, -0.7383]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -1.7344,  1.7578, -1.1797, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -1.8359,  1.9609,  0.1865, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -3.2969, -1.2109,  2.2344, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:00,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.42 | optimizer_step: 0.37
[2025-11-06 18:08:00,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.86 | bwd_microstep: 1204.95 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1203.80 | step_microstep: 3.54
[2025-11-06 18:08:00,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.34 | bwd: 1206.11 | bwd_inner: 2.03 | bwd_allreduce: 1203.88 | step: 3.65
 27%|██▋       | 939/3507 [23:14<58:53,  1.38s/it]                                                  {'loss': 0.5288, 'learning_rate': 1.7182837734202316e-05, 'epoch': 0.27}
 27%|██▋       | 939/3507 [23:14<58:53,  1.38s/it]tensor([[-4.4375, -2.2812,  1.2500, -0.6133, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -1.8750,  1.2656, -0.4980, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -4.2812,  0.2441, -0.4160, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:00,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.95 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.1562, -4.0625,  0.3906,  0.1729, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -0.4258,  3.2500, -0.9336, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -3.3281,  0.1592,  0.8984, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -2.1562,  1.0625,  1.8672, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -3.5625,  0.7383, -2.8125, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:08:02,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:08:02,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.49 | bwd_microstep: 1656.64 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1655.47 | step_microstep: 2.52
[2025-11-06 18:08:02,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 464.45 | bwd: 1657.61 | bwd_inner: 1.95 | bwd_allreduce: 1655.51 | step: 2.59
 27%|██▋       | 940/3507 [23:16<1:09:01,  1.61s/it]                                                    {'loss': 0.2888, 'learning_rate': 1.7176407843563398e-05, 'epoch': 0.27}
 27%|██▋       | 940/3507 [23:16<1:09:01,  1.61s/it]tensor([[-4.0625, -3.3125, -1.0625,  1.1719, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -0.9883,  2.0469, -1.1719, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -4.1250, -0.9883,  1.1719, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.9062,  0.9023,  0.6133, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:02,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.52 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -3.8281, -1.7109,  0.9883, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -1.1562,  2.3281, -2.7812, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -2.7969,  1.1328,  0.9844, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938, -2.9219, -0.4727,  2.6406, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:04,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:08:04,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.16 | bwd_microstep: 1108.30 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1107.12 | step_microstep: 1.87
[2025-11-06 18:08:04,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 549.70 | bwd: 1109.35 | bwd_inner: 2.05 | bwd_allreduce: 1107.17 | step: 1.96
 27%|██▋       | 941/3507 [23:18<1:10:10,  1.64s/it]                                                    {'loss': 0.2581, 'learning_rate': 1.7169971829511326e-05, 'epoch': 0.27}
 27%|██▋       | 941/3507 [23:18<1:10:10,  1.64s/it]tensor([[-3.2344, -2.7500, -0.5586,  2.9375, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:04,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.47 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.6719, -1.3125,  1.4688,  2.5781, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.4375,  1.5625, -0.2734, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -1.3594,  1.4219,  1.4062, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7500, -1.4688,  1.3047,  2.6094, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -3.8125, -1.7734,  1.6406, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -2.7656,  1.4062, -0.6406, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -2.9219,  0.1611,  1.0703, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:08:05,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:08:05,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.72 | bwd_microstep: 483.24 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 482.06 | step_microstep: 1.54
[2025-11-06 18:08:05,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.22 | bwd: 484.23 | bwd_inner: 1.99 | bwd_allreduce: 482.10 | step: 1.63
 27%|██▋       | 942/3507 [23:18<59:42,  1.40s/it]                                                    {'loss': 0.5386, 'learning_rate': 1.7163529697537756e-05, 'epoch': 0.27}
 27%|██▋       | 942/3507 [23:18<59:42,  1.40s/it]tensor([[-4.7812e+00, -2.6250e+00,  1.2969e+00,  4.3945e-03, -3.6719e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -3.6875, -1.4688,  1.2656, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:05,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.69 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8438, -2.0781,  1.2188,  1.1016, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -2.2656,  0.7109,  2.0312, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -1.2656,  1.7812, -1.9531, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281, -2.3281, -0.2852,  0.9492, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7344, -2.6250,  0.2139,  2.0000, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.3281,  0.9570,  0.7930, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:08:05,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:08:05,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.93 | bwd_microstep: 54.31 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 53.21 | step_microstep: 1.36
[2025-11-06 18:08:05,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.65 | bwd: 55.18 | bwd_inner: 1.81 | bwd_allreduce: 53.24 | step: 1.44
 27%|██▋       | 943/3507 [23:19<47:04,  1.10s/it]                                                  {'loss': 0.3321, 'learning_rate': 1.7157081453139564e-05, 'epoch': 0.27}
 27%|██▋       | 943/3507 [23:19<47:04,  1.10s/it]tensor([[-4.2500, -2.9375,  0.1021,  1.3438, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:05,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.70 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2500, -3.3125,  0.2490, -0.5977, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -2.5781,  0.6055,  0.2715, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -1.3125,  2.5469, -0.6211, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -2.5625,  0.8164,  1.2188, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -0.7227,  2.9062, -1.5234, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -2.9375,  0.8438,  1.5234, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -2.2500,  0.9258,  1.2812, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:07,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:08:07,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.38 | bwd_microstep: 1216.68 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1215.60 | step_microstep: 1.83
[2025-11-06 18:08:07,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.10 | bwd: 1217.56 | bwd_inner: 1.81 | bwd_allreduce: 1215.64 | step: 1.90
 27%|██▋       | 944/3507 [23:20<52:56,  1.24s/it]                                                  {'loss': 0.4012, 'learning_rate': 1.7150627101818848e-05, 'epoch': 0.27}
 27%|██▋       | 944/3507 [23:20<52:56,  1.24s/it]tensor([[-3.7344, -3.1094, -0.8438,  2.0469, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9375, -1.6953, -0.3750,  2.6562, -0.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.1875, -2.8281, -0.9375,  2.2812, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:07,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.40 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-3.1719, -2.0938,  0.3574,  1.5938, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -2.1406,  1.0234,  1.4375, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031, -1.7266,  1.1172,  1.2812, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8281, -2.4531, -0.5312,  2.6250, -1.4297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.8594,  0.2793,  1.6406, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:08,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:08:08,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.44 | bwd_microstep: 1104.34 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1103.12 | step_microstep: 2.07
[2025-11-06 18:08:08,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.87 | bwd: 1105.17 | bwd_inner: 1.87 | bwd_allreduce: 1103.17 | step: 2.17
 27%|██▋       | 945/3507 [23:22<56:02,  1.31s/it]                                                  {'loss': 0.7474, 'learning_rate': 1.7144166649082907e-05, 'epoch': 0.27}
 27%|██▋       | 945/3507 [23:22<56:02,  1.31s/it]tensor([[-4.1875, -2.4688,  0.8945,  0.5625, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:08,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.98 | bwd_microstep: 2.30 | bwd_inner_microstep: 2.16 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.9062, -2.9844, -0.2656,  1.9609, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0938, -0.5508,  1.9766,  2.1875, -1.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -4.7188, -1.5938,  0.9531, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0000, -4.6250, -1.1562, -0.1611, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5625,  0.0991,  2.7031,  2.4688, -0.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -2.9844, -0.5312,  2.2188, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2344, -2.1094, -0.4785,  3.5156, -0.7695]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:10,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:08:10,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.06 | bwd_microstep: 1668.45 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 1667.56 | step_microstep: 2.26
[2025-11-06 18:08:10,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.07 | bwd: 1670.76 | bwd_inner: 2.98 | bwd_allreduce: 1667.62 | step: 2.36
 27%|██▋       | 946/3507 [23:24<1:05:21,  1.53s/it]                                                    {'loss': 0.4826, 'learning_rate': 1.7137700100444257e-05, 'epoch': 0.27}
 27%|██▋       | 946/3507 [23:24<1:05:21,  1.53s/it]tensor([[-4.1250, -1.5703,  2.3281, -0.6094, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -2.3750,  1.0156, -0.9258, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -2.0938,  1.8594,  0.8945, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:10,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.34 | bwd_microstep: 3.00 | bwd_inner_microstep: 2.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.5000, -1.6641,  0.3965,  2.0781, -1.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719, -1.4453,  1.0547,  1.8438, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.5781,  0.1338, -0.3340, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7031, -1.2891,  2.4375,  0.3965, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281, -2.8281, -0.1230,  1.6875, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:12,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.27 | optimizer_step: 0.41
[2025-11-06 18:08:12,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.38 | bwd_microstep: 1463.22 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1462.11 | step_microstep: 2.82
[2025-11-06 18:08:12,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.72 | bwd: 1466.21 | bwd_inner: 3.88 | bwd_allreduce: 1462.17 | step: 2.92
 27%|██▋       | 947/3507 [23:26<1:10:20,  1.65s/it]                                                    {'loss': 0.3807, 'learning_rate': 1.7131227461420605e-05, 'epoch': 0.27}
 27%|██▋       | 947/3507 [23:26<1:10:20,  1.65s/it]tensor([[-4.3750, -3.6250, -1.0938,  1.2812, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -2.7812, -0.2559,  1.6328, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:12,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.15 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6094, -3.5625, -2.2031,  1.0312, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7969, -1.0234,  1.6406,  0.4746, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -2.8438, -0.1226,  1.6094, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5000, -0.8164,  1.7031,  0.7188, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0781, -1.4219,  1.6641,  1.7344, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -2.7031, -0.3027,  2.0469, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:13,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 18:08:13,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.20 | bwd_microstep: 516.18 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 515.03 | step_microstep: 1.81
[2025-11-06 18:08:13,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.37 | bwd: 517.05 | bwd_inner: 1.86 | bwd_allreduce: 515.07 | step: 1.90
 27%|██▋       | 948/3507 [23:27<1:01:01,  1.43s/it]                                                    {'loss': 0.4928, 'learning_rate': 1.712474873753486e-05, 'epoch': 0.27}
 27%|██▋       | 948/3507 [23:27<1:01:01,  1.43s/it]tensor([[-4.0312, -3.7031, -1.5859,  2.2969, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -4.2500, -0.6367,  1.6328, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -2.8750,  1.0391,  0.3340, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:13,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.58 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9375, -2.4531,  1.7812, -0.1079, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000, -2.5469, -0.3906,  3.0000, -1.4766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0312, -5.1875, -2.2031,  0.5508, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -2.2969,  0.9844,  0.4863, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.3906,  0.2656,  1.0391, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:15,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:08:15,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.02 | bwd_microstep: 1289.35 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1288.23 | step_microstep: 1.89
[2025-11-06 18:08:15,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 506.62 | bwd: 1290.30 | bwd_inner: 1.89 | bwd_allreduce: 1288.27 | step: 1.96
 27%|██▋       | 949/3507 [23:29<1:06:14,  1.55s/it]                                                    {'loss': 0.2193, 'learning_rate': 1.7118263934315122e-05, 'epoch': 0.27}
 27%|██▋       | 949/3507 [23:29<1:06:14,  1.55s/it]tensor([[-3.6562, -3.1250, -0.7344,  2.6875, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -2.5469,  0.8711,  1.9453, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:15,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.34 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5312, -2.4375,  0.1270,  1.5312, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -0.6172,  3.0625, -0.5781, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.1406,  1.3984,  0.6953, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6406, -0.6836,  1.9844,  0.5195, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.1406, -0.1865,  1.6016, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -3.2344,  1.0469, -0.4043, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:08:15,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:08:15,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.08 | bwd_microstep: 102.36 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 101.27 | step_microstep: 1.41
[2025-11-06 18:08:15,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.45 | bwd: 103.28 | bwd_inner: 1.86 | bwd_allreduce: 101.30 | step: 1.48
 27%|██▋       | 950/3507 [23:29<52:04,  1.22s/it]                                                    {'loss': 0.4449, 'learning_rate': 1.711177305729468e-05, 'epoch': 0.27}
 27%|██▋       | 950/3507 [23:29<52:04,  1.22s/it]tensor([[-3.0469, -2.3750, -0.2832,  2.1562, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1250,  0.0420,  2.2969, -0.9492, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:08:15,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.52 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6875, -2.7031,  1.2500,  0.7930, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969, -0.7695,  2.8750, -0.0303, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1250, -1.7188,  1.4062,  2.6875, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -0.6953,  2.7656, -0.6250, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -2.7656,  0.5977, -0.0250, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3750,  0.2119,  2.5000, -1.8438, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:08:16,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:08:16,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.31 | bwd_microstep: 846.20 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 845.00 | step_microstep: 2.23
[2025-11-06 18:08:17,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.87 | bwd: 847.19 | bwd_inner: 2.01 | bwd_allreduce: 845.04 | step: 2.30
 27%|██▋       | 951/3507 [23:30<52:14,  1.23s/it]                                                  {'loss': 0.5865, 'learning_rate': 1.7105276112012008e-05, 'epoch': 0.27}
 27%|██▋       | 951/3507 [23:30<52:14,  1.23s/it]tensor([[-3.4531, -1.0078,  2.2344, -0.6406, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8281, -0.1533,  2.2969,  1.3828, -1.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -2.8438,  0.0884,  2.1875, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:17,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.05 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4062, -0.7617,  2.7344, -0.8359, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.0000, -1.5156,  2.3125, -0.2070, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -1.7109,  1.4688, -0.5430, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1875, -2.1406,  0.4121,  1.8984, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -2.6875,  1.7891, -0.1650, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:19,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:08:19,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.32 | bwd_microstep: 2043.15 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 2042.03 | step_microstep: 2.14
[2025-11-06 18:08:19,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.39 | bwd: 2044.03 | bwd_inner: 1.82 | bwd_allreduce: 2042.08 | step: 2.22
 27%|██▋       | 952/3507 [23:33<1:07:59,  1.60s/it]                                                    {'loss': 0.6102, 'learning_rate': 1.709877310401075e-05, 'epoch': 0.27}
 27%|██▋       | 952/3507 [23:33<1:07:59,  1.60s/it]tensor([[-4.1875, -2.4688,  1.0703,  1.5234, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -3.2031, -1.0703,  2.2188, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:19,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.70 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -4.0938, -1.5312,  1.6250, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.1250, -1.4688,  1.7891, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -2.3750,  1.1719, -0.5312, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5938, -1.0234,  1.2422,  0.0859, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2188, -4.8750, -1.2266,  0.4316, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -1.9375,  2.2031, -0.7695, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:19,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 18:08:19,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.93 | bwd_microstep: 1.83 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.65
[2025-11-06 18:08:19,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.66 | bwd: 2.85 | bwd_inner: 1.93 | bwd_allreduce: 0.78 | step: 1.73
 27%|██▋       | 953/3507 [23:33<52:35,  1.24s/it]                                                    {'loss': 0.247, 'learning_rate': 1.7092264038839724e-05, 'epoch': 0.27}
 27%|██▋       | 953/3507 [23:33<52:35,  1.24s/it]tensor([[-4.2500, -4.0312, -1.9453,  2.2031, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:20,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.88 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6250, -0.1128,  3.4062,  0.3066, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -1.0391,  2.5312, -0.2051, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -1.8750,  1.2500,  0.5820, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -2.7500,  1.1406,  1.3672, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.5859, 3.4375, 5.0625, 3.2188, 1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -3.2969,  1.1641,  0.5391, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -2.5469,  0.0986,  1.2734, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:21,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:08:21,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.60 | bwd_microstep: 1524.67 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1523.56 | step_microstep: 2.01
[2025-11-06 18:08:21,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.50 | bwd: 1525.61 | bwd_inner: 1.89 | bwd_allreduce: 1523.60 | step: 2.08
 27%|██▋       | 954/3507 [23:35<1:01:03,  1.43s/it]                                                    {'loss': 0.6717, 'learning_rate': 1.7085748922052923e-05, 'epoch': 0.27}
 27%|██▋       | 954/3507 [23:35<1:01:03,  1.43s/it]tensor([[-3.2188, -0.7305,  1.9141, -1.4766, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:21,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 87.90 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.0000, -2.2188,  0.3730,  3.2656, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -4.3125, -1.5859,  1.6719, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -2.9844, -0.3652,  1.6328, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -1.1016,  2.9844, -0.1494, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0156, -0.3984,  2.5781, -1.2500, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250, -2.9062, -0.5234,  2.0781, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -2.4062,  0.1826,  2.0938, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:22,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:08:22,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.50 | bwd_microstep: 800.92 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 799.85 | step_microstep: 1.47
[2025-11-06 18:08:22,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.41 | bwd: 801.77 | bwd_inner: 1.74 | bwd_allreduce: 799.89 | step: 1.55
 27%|██▋       | 955/3507 [23:36<57:50,  1.36s/it]                                                    {'loss': 0.0977, 'learning_rate': 1.7079227759209503e-05, 'epoch': 0.27}
 27%|██▋       | 955/3507 [23:36<57:50,  1.36s/it]tensor([[-3.7188, -1.9219,  1.1953,  0.9141, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:23,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3594, -1.1797,  2.3281,  0.7422, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -2.1562,  1.5312,  0.3281, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1719, -0.6133,  2.6562, -0.3789, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.2344, -0.2969,  2.0469, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -2.1094,  1.2656, -1.2266, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -2.8281,  0.0640,  1.2344, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -2.1719,  1.9062,  0.9062, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:24,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:08:24,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.85 | bwd_microstep: 685.86 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 684.80 | step_microstep: 1.87
[2025-11-06 18:08:24,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.42 | bwd: 686.82 | bwd_inner: 1.83 | bwd_allreduce: 684.84 | step: 1.95
 27%|██▋       | 956/3507 [23:37<54:14,  1.28s/it]                                                  {'loss': 0.2927, 'learning_rate': 1.7072700555873774e-05, 'epoch': 0.27}
 27%|██▋       | 956/3507 [23:37<54:14,  1.28s/it]tensor([[-1.6172,  0.4180,  2.0312, -0.7148, -1.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2344, -0.9414,  2.4062,  0.2119, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:24,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.69 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0000, -2.7656,  1.3281,  0.2676, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -1.3984,  2.7500, -0.8828, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -2.5000,  1.6172,  0.1455, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -1.6875,  1.7969,  1.1484, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7656, -2.0938,  1.3047,  1.7266, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -1.2266,  2.1719, -0.9453, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:24,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 18:08:24,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.41 | bwd_microstep: 265.76 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 264.69 | step_microstep: 1.96
[2025-11-06 18:08:24,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 203.11 | bwd: 266.46 | bwd_inner: 1.61 | bwd_allreduce: 264.72 | step: 2.03
 27%|██▋       | 957/3507 [23:38<44:17,  1.04s/it]                                                  {'loss': 0.6452, 'learning_rate': 1.7066167317615203e-05, 'epoch': 0.27}
 27%|██▋       | 957/3507 [23:38<44:17,  1.04s/it]tensor([[-3.4375, -0.6992,  2.4688, -1.1953, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:24,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.07 | bwd_microstep: 2.87 | bwd_inner_microstep: 2.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-4.1250, -1.3906,  2.7969, -0.1250, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -3.0938, -0.6250,  2.1250, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6406, -0.1484,  2.1562, -1.4609, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2500, -0.4062,  3.0938, -0.9414, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -2.3438,  0.6133,  1.4141, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7344, -3.0312, -0.6680,  1.7969, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -2.5156,  0.5625,  0.6758, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:08:25,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:08:25,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.99 | bwd_microstep: 889.99 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 888.96 | step_microstep: 1.64
[2025-11-06 18:08:25,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.13 | bwd: 892.85 | bwd_inner: 3.68 | bwd_allreduce: 889.01 | step: 1.76
 27%|██▋       | 958/3507 [23:39<47:21,  1.11s/it]                                                  {'loss': 0.4879, 'learning_rate': 1.7059628050008403e-05, 'epoch': 0.27}
 27%|██▋       | 958/3507 [23:39<47:21,  1.11s/it]tensor([[-6.7188, -6.3438, -4.0000, -0.7344, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.9062, -1.4766,  0.9297,  1.5000, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:25,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.49 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.1250, -3.9531, -0.5469,  1.6953, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -2.1875,  1.5312,  0.3750, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3125, -1.2188,  1.8438,  0.0130, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0938, -0.8672,  2.2969,  0.1367, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -3.0469,  0.6133, -1.8750, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -2.9062,  0.5547, -0.2363, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:08:26,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.25 | optimizer_step: 0.16
[2025-11-06 18:08:26,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 40.89 | bwd_microstep: 218.07 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 217.20 | step_microstep: 1.75
[2025-11-06 18:08:26,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 169.40 | bwd: 218.95 | bwd_inner: 1.59 | bwd_allreduce: 217.23 | step: 1.84
 27%|██▋       | 959/3507 [23:40<38:23,  1.11it/s]                                                  {'loss': 1.0826, 'learning_rate': 1.7053082758633138e-05, 'epoch': 0.27}
 27%|██▋       | 959/3507 [23:40<38:23,  1.11it/s]tensor([[-4.3750, -1.5156,  1.7969, -2.5156, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -2.7812,  0.1943,  2.1406, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -0.3477,  1.8672, -0.2578, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7969e+00, -1.6484e+00,  1.5469e+00, -2.0294e-03, -2.9531e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:27,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.04 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-2.5312, -0.0752,  2.6562, -0.8789, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0000, -2.7812,  0.1050,  1.3203, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3594, -0.4980,  1.9844,  0.4980, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -2.9219,  1.2188,  0.9531, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:28,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:08:28,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.96 | bwd_microstep: 1687.49 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1686.45 | step_microstep: 2.06
[2025-11-06 18:08:28,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.01 | bwd: 1688.55 | bwd_inner: 1.91 | bwd_allreduce: 1686.49 | step: 2.16
 27%|██▋       | 960/3507 [23:42<1:01:49,  1.46s/it]                                                    {'loss': 0.7062, 'learning_rate': 1.7046531449074305e-05, 'epoch': 0.27}
 27%|██▋       | 960/3507 [23:42<1:01:49,  1.46s/it]tensor([[-3.8906, -3.0938, -0.8125,  1.3359, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.1250, -0.2139,  2.0781, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:29,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.06 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.1719, -0.3457,  3.2812, -0.7695, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -2.8125,  0.4609,  1.0469, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -2.8750,  0.7344,  1.7969, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.1250, -1.3125,  0.9727, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062, -2.1562, -0.9414,  1.5625, -1.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.9062, -0.9844,  1.8516,  0.6328, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:29,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:08:29,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.36 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.64 | step_microstep: 1.40
[2025-11-06 18:08:29,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.44 | bwd: 2.61 | bwd_inner: 1.82 | bwd_allreduce: 0.67 | step: 1.48
 27%|██▋       | 961/3507 [23:43<48:17,  1.14s/it]                                                    {'loss': 0.5539, 'learning_rate': 1.7039974126921946e-05, 'epoch': 0.27}
 27%|██▋       | 961/3507 [23:43<48:17,  1.14s/it]tensor([[-2.7188, -0.1592,  2.6094, -0.8359, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1719,  0.0216,  2.3125,  3.6406, -0.3164]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:29,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.63 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.4062,  0.1162,  2.2500, -1.8750, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.2188, -1.8828,  0.3047,  0.1172, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.0781, -0.1777,  1.3125, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -3.9062, -0.5156,  1.3125, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469, -2.7500, -0.2295,  2.2969, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -4.0312, -0.8750,  2.0625, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:08:31,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:08:31,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.56 | bwd_microstep: 1613.28 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 1611.95 | step_microstep: 2.04
[2025-11-06 18:08:31,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.23 | bwd: 1614.25 | bwd_inner: 2.08 | bwd_allreduce: 1612.01 | step: 2.14
 27%|██▋       | 962/3507 [23:45<59:49,  1.41s/it]                                                  {'loss': 0.7427, 'learning_rate': 1.703341079777122e-05, 'epoch': 0.27}
 27%|██▋       | 962/3507 [23:45<59:49,  1.41s/it]tensor([[-4.1875, -1.8594,  2.0938,  0.4023, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:31,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.90 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5938, -1.7266,  1.2500,  0.0864, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -1.2031,  1.7812,  0.5312, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.5938, -1.6797,  0.7695, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -1.8203,  1.5859,  0.2852, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.8281, -1.6172,  0.8672, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625, -2.0000, -0.3574,  1.8125, -1.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -1.8047,  2.2031,  0.5938, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:31,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:08:31,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.32 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.56
[2025-11-06 18:08:31,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.24 | bwd: 2.99 | bwd_inner: 2.06 | bwd_allreduce: 0.80 | step: 1.65
 27%|██▋       | 963/3507 [23:45<47:38,  1.12s/it]                                                  {'loss': 0.5197, 'learning_rate': 1.7026841467222425e-05, 'epoch': 0.27}
 27%|██▋       | 963/3507 [23:45<47:38,  1.12s/it]tensor([[-4.6250, -2.8281,  1.0781,  1.6641, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -4.0312, -0.3613,  0.4727, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:32,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.67 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.6875, -3.1406, -0.8086,  2.4219, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -1.4297,  1.4453, -0.1846, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -4.0312, -1.3516,  1.7109, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125,  0.1182,  2.3125, -1.2656, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.5938, -3.1406, -0.9961,  2.1562, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5938, -1.6875,  1.4141,  0.8633, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:34,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.23
[2025-11-06 18:08:34,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.10 | bwd_microstep: 1901.41 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1900.20 | step_microstep: 2.00
[2025-11-06 18:08:34,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.81 | bwd: 1902.42 | bwd_inner: 2.03 | bwd_allreduce: 1900.25 | step: 2.11
 27%|██▋       | 964/3507 [23:47<1:01:58,  1.46s/it]                                                    {'loss': 0.512, 'learning_rate': 1.7020266140880967e-05, 'epoch': 0.27}
 27%|██▋       | 964/3507 [23:47<1:01:58,  1.46s/it]tensor([[-3.0312, -2.7812, -1.2188,  1.7656, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -1.5234,  1.9531, -0.5977, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6094, -2.4219, -1.1094,  2.2031, -1.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -3.3750, -0.2969,  0.4668, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:34,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.18 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9219, -3.2500, -0.8008,  1.9844, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.8125,  0.1128,  1.0156, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625, -0.2490,  2.9531,  0.9336, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -1.2422,  2.5156, -0.9258, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:34,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:08:34,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.50 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.93 | step_microstep: 1.67
[2025-11-06 18:08:34,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.71 | bwd: 3.03 | bwd_inner: 1.92 | bwd_allreduce: 0.96 | step: 1.75
 28%|██▊       | 965/3507 [23:48<49:02,  1.16s/it]                                                    {'loss': 0.7734, 'learning_rate': 1.7013684824357376e-05, 'epoch': 0.28}
 28%|██▊       | 965/3507 [23:48<49:02,  1.16s/it]tensor([[-4.8438, -2.3281,  0.5000, -2.6094, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:08:34,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.84 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.7656, -3.2500, -0.7305,  2.9688, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344, -0.8828,  1.8203, -1.1641, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.1562, -3.1875, -0.3340,  2.1562, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4219, -3.1094, -1.0078,  2.7656, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6250, -0.2988,  2.5156, -0.1631, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031, -0.6680,  2.4531,  1.0938, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -2.5469,  0.3828,  1.2734, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:36,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:08:36,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.04 | bwd_microstep: 1224.11 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1223.05 | step_microstep: 1.72
[2025-11-06 18:08:36,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.90 | bwd: 1225.13 | bwd_inner: 1.87 | bwd_allreduce: 1223.10 | step: 1.82
 28%|██▊       | 966/3507 [23:49<54:47,  1.29s/it]                                                  {'loss': 1.1205, 'learning_rate': 1.7007097523267292e-05, 'epoch': 0.28}
 28%|██▊       | 966/3507 [23:49<54:47,  1.29s/it]tensor([[-2.3594,  0.0349,  2.4219, -1.0781, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -0.5469,  3.0312, -0.4668, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -4.0312, -1.3594,  0.5820, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0781, -2.6406, -0.6836,  2.2812, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:36,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.36 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5625, -0.7852,  2.5938, -0.8555, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -2.5469,  0.3262,  1.3359, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -4.9062, -2.0781,  1.2266, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.7812,  0.6328,  1.1719, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:36,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:08:36,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.01 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.78 | step_microstep: 1.80
[2025-11-06 18:08:36,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.37 | bwd: 3.11 | bwd_inner: 2.19 | bwd_allreduce: 0.81 | step: 1.87
 28%|██▊       | 967/3507 [23:50<44:39,  1.05s/it]                                                  {'loss': 0.2984, 'learning_rate': 1.7000504243231466e-05, 'epoch': 0.28}
 28%|██▊       | 967/3507 [23:50<44:39,  1.05s/it]tensor([[-4.4062, -3.8906, -1.3828,  1.9609, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:36,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.34 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3438, -2.1719,  0.7695, -0.7266, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7500, -2.5781, -0.8398,  2.8438, -1.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -0.2109,  3.0625, -0.9453, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7812, -2.0625,  1.3516,  1.6953, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -3.2969, -0.6641,  1.6797, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -3.8594, -0.1338, -1.4922, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -2.3281,  0.9961,  0.8359, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:08:38,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:08:38,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.55 | bwd_microstep: 1391.69 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1390.61 | step_microstep: 1.92
[2025-11-06 18:08:38,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.91 | bwd: 1392.76 | bwd_inner: 1.98 | bwd_allreduce: 1390.65 | step: 2.00
 28%|██▊       | 968/3507 [23:52<54:08,  1.28s/it]                                                  {'loss': 0.2696, 'learning_rate': 1.6993904989875737e-05, 'epoch': 0.28}
 28%|██▊       | 968/3507 [23:52<54:08,  1.28s/it]tensor([[-6.1875, -3.8750,  0.6250,  0.0145, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4062, -3.2812, -0.1758,  2.2031, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:08:38,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.55 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([3], device='cuda:2')
tensor([[-2.5938, -0.1621,  1.9297, -1.4844, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2188, -0.3965,  1.7734,  0.4277, -1.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -1.6406,  1.9922,  1.4297, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -2.8125,  0.4004,  2.4375, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -3.5625, -0.6602,  1.4609, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -2.3125,  1.7812,  0.0286, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:08:38,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:08:38,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.15 | bwd_microstep: 141.69 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 140.64 | step_microstep: 2.03
[2025-11-06 18:08:38,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.72 | bwd: 142.59 | bwd_inner: 1.80 | bwd_allreduce: 140.68 | step: 2.11
 28%|██▊       | 969/3507 [23:52<44:03,  1.04s/it]                                                  {'loss': 0.503, 'learning_rate': 1.6987299768831057e-05, 'epoch': 0.28}
 28%|██▊       | 969/3507 [23:52<44:03,  1.04s/it]tensor([[-5.3125, -3.3750,  0.4102,  0.4980, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -2.7969,  1.2422,  0.0684, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -2.6562, -0.4180,  2.2031, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:39,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.16 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.6094, -1.4219,  1.8516,  0.4121, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -2.6406,  0.7695,  0.6641, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.6875, -6.7500, -2.6406, -2.2344, -6.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -1.4844,  2.0781,  0.3301, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5000,  0.1865,  2.8750, -1.1406, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:41,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:08:41,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.71 | bwd_microstep: 1944.07 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1942.81 | step_microstep: 1.84
[2025-11-06 18:08:41,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.90 | bwd: 1945.02 | bwd_inner: 2.02 | bwd_allreduce: 1942.86 | step: 1.93
 28%|██▊       | 970/3507 [23:55<1:02:19,  1.47s/it]                                                    {'loss': 0.6177, 'learning_rate': 1.6980688585733456e-05, 'epoch': 0.28}
 28%|██▊       | 970/3507 [23:55<1:02:19,  1.47s/it]tensor([[-2.0469, -1.2188,  0.8867,  2.5625, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -1.3203,  2.1094, -1.7422, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -3.7344,  0.1475,  1.3516, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -0.4902,  2.5625, -1.2109, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:41,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.46 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0000, -3.2969, -0.9102,  1.7031, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -2.1562,  1.1719,  0.4180, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -1.0312,  1.5312, -1.4531, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -3.5312, -1.5625,  1.9609, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:41,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:08:41,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.08 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.65 | step_microstep: 1.90
[2025-11-06 18:08:41,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.57 | bwd: 2.52 | bwd_inner: 1.73 | bwd_allreduce: 0.68 | step: 1.97
 28%|██▊       | 971/3507 [23:55<48:47,  1.15s/it]                                                    {'loss': 0.1667, 'learning_rate': 1.6974071446224066e-05, 'epoch': 0.28}
 28%|██▊       | 971/3507 [23:55<48:47,  1.15s/it]tensor([[-2.8125, -2.5781, -1.0469,  2.1562, -1.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1875,  0.1982,  2.6875, -0.1377, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:42,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.54 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.9531, -2.8438,  0.0062,  2.0156, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.3438, -2.2344,  1.3516, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -3.7188, -1.5781,  2.0469, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -1.5703,  1.9141, -1.0781, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -3.2500, -1.6641,  1.2812, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3281, -0.4375,  1.6719, -0.1328, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:08:43,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.27 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:08:43,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.39 | bwd_microstep: 1263.04 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1261.96 | step_microstep: 3.66
[2025-11-06 18:08:43,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.95 | bwd: 1264.01 | bwd_inner: 1.85 | bwd_allreduce: 1262.02 | step: 3.77
 28%|██▊       | 972/3507 [23:57<53:58,  1.28s/it]                                                  {'loss': 0.1179, 'learning_rate': 1.6967448355949087e-05, 'epoch': 0.28}
 28%|██▊       | 972/3507 [23:57<53:58,  1.28s/it]tensor([[-3.8750, -2.6875,  0.0277,  1.4609, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -3.9531, -1.4297,  2.4688, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:43,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.06 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-4.1250, -3.4062, -0.8242,  2.0938, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -2.8125,  1.3594,  0.3848, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7969, -1.4766,  1.9375,  0.0055, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -3.1250,  0.2002, -1.0234, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -2.6250,  0.3359,  1.3984, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -3.1250,  1.3438,  0.0942, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:08:43,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 18:08:43,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.93 | bwd_microstep: 157.29 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 156.10 | step_microstep: 1.38
[2025-11-06 18:08:43,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.00 | bwd: 158.41 | bwd_inner: 2.07 | bwd_allreduce: 156.17 | step: 1.49
 28%|██▊       | 973/3507 [23:57<44:06,  1.04s/it]                                                  {'loss': 0.3397, 'learning_rate': 1.6960819320559806e-05, 'epoch': 0.28}
 28%|██▊       | 973/3507 [23:57<44:06,  1.04s/it]tensor([[-3.7656, -1.5781,  1.6094,  0.0417, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:44,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.38 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.18
tensor([[-5.8438, -4.0000,  0.0571,  0.6797, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875, -1.1094,  2.6562,  0.1089, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5625, -2.0312,  0.0212,  2.8906, -1.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3125, -1.1406,  2.4219,  1.2266, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -3.5000, -1.5703,  2.2344, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -2.3594,  0.7227,  1.7500, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -3.2500,  0.6250,  0.6641, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:08:46,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:08:46,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.51 | bwd_microstep: 1732.67 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1731.59 | step_microstep: 2.06
[2025-11-06 18:08:46,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.93 | bwd: 1733.71 | bwd_inner: 1.95 | bwd_allreduce: 1731.63 | step: 2.24
 28%|██▊       | 974/3507 [23:59<57:44,  1.37s/it]                                                  {'loss': 0.3598, 'learning_rate': 1.6954184345712575e-05, 'epoch': 0.28}
 28%|██▊       | 974/3507 [23:59<57:44,  1.37s/it]tensor([[-5.7188, -4.2500, -0.7539,  0.5039, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -2.3125,  1.0234,  1.3594, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -2.9062,  0.4863,  1.5703, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:46,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.75 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6094, -2.1094,  1.1641,  2.0781, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -1.9062,  0.8984,  0.1768, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.9531,  0.5000,  1.3672, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406, -0.2334,  3.1562,  1.3047, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4062, -0.4961,  1.9219,  0.5000, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
[2025-11-06 18:08:46,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:08:46,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.08 | bwd_microstep: 72.16 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 71.05 | step_microstep: 1.60
[2025-11-06 18:08:46,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.86 | bwd: 73.07 | bwd_inner: 1.85 | bwd_allreduce: 71.09 | step: 1.69
 28%|██▊       | 975/3507 [24:00<47:10,  1.12s/it]                                                  {'loss': 1.3164, 'learning_rate': 1.6947543437068822e-05, 'epoch': 0.28}
 28%|██▊       | 975/3507 [24:00<47:10,  1.12s/it]tensor([[-3.6719, -0.9297,  2.7969, -0.6211, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -2.1875,  1.5469,  0.0098, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -3.3438, -0.5234,  1.8203, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:46,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.53 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.5625, -1.0000,  1.6797,  1.9688, -1.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.5469, -0.4180,  1.4062, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9375, -4.6875, -1.0859,  0.9805, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -1.9453,  2.1250, -0.8828, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -1.9609,  1.5469,  0.8320, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:08:47,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:08:47,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 585.73 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 584.66 | step_microstep: 1.68
[2025-11-06 18:08:47,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.41 | bwd: 586.59 | bwd_inner: 1.74 | bwd_allreduce: 584.70 | step: 1.77
 28%|██▊       | 976/3507 [24:01<45:49,  1.09s/it]                                                  {'loss': 0.2598, 'learning_rate': 1.694089660029504e-05, 'epoch': 0.28}
 28%|██▊       | 976/3507 [24:01<45:49,  1.09s/it]tensor([[-4.7500, -3.5156, -0.3672,  1.3359, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -2.0156,  1.0391,  0.5234, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:47,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.28 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1250, -2.2188,  0.9414,  0.3242, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -2.1562,  0.2324, -2.0156, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.9062, -3.6875,  0.6133,  0.0203, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -3.8906, -0.4453,  0.9297, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500, -0.9102,  2.2812, -1.6562, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -2.4062,  0.7773,  1.3203, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:48,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:08:48,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.11 | bwd_microstep: 970.27 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 969.27 | step_microstep: 2.20
[2025-11-06 18:08:48,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.42 | bwd: 971.04 | bwd_inner: 1.58 | bwd_allreduce: 969.31 | step: 2.28
 28%|██▊       | 977/3507 [24:02<49:33,  1.18s/it]                                                  {'loss': 0.7593, 'learning_rate': 1.6934243841062767e-05, 'epoch': 0.28}
 28%|██▊       | 977/3507 [24:02<49:33,  1.18s/it]tensor([[-3.7812, -2.4688,  0.4219,  1.8594, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:49,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 102.67 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2500, -0.0518,  2.3125, -0.8633, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -0.5156,  2.9531, -0.4297, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0469, -1.0312,  1.3359, -1.1172, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.4375, -0.4395,  0.9180, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -4.1250, -1.0859,  1.1016, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -4.2188, -0.8203,  0.5977, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.0156,  1.8125,  0.6367, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:08:49,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:08:49,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.33 | bwd_microstep: 297.56 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 296.55 | step_microstep: 2.30
[2025-11-06 18:08:49,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.02 | bwd: 298.46 | bwd_inner: 1.75 | bwd_allreduce: 296.58 | step: 2.37
 28%|██▊       | 978/3507 [24:03<42:45,  1.01s/it]                                                  {'loss': 0.3707, 'learning_rate': 1.6927585165048604e-05, 'epoch': 0.28}
 28%|██▊       | 978/3507 [24:03<42:45,  1.01s/it]tensor([[-7.7500, -6.5938, -2.9219, -0.5039, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -1.4766,  1.7734, -0.4414, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.3125, -1.0234,  1.7031, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.0625, -1.9141,  1.0234, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -1.8828,  1.1406, -0.2461, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -3.1250, -0.3164,  1.7109, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:50,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.84 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.9062, -3.3125,  1.2969, -0.3867, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -2.0938,  1.1250,  0.3711, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:51,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:08:51,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.05 | bwd_microstep: 1.67 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.59 | step_microstep: 1.71
[2025-11-06 18:08:51,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.87 | bwd: 2.65 | bwd_inner: 1.90 | bwd_allreduce: 0.62 | step: 1.79
 28%|██▊       | 979/3507 [24:04<48:47,  1.16s/it]                                                  {'loss': 0.3514, 'learning_rate': 1.6920920577934202e-05, 'epoch': 0.28}
 28%|██▊       | 979/3507 [24:04<48:47,  1.16s/it]tensor([[ 0.0063,  2.4844,  4.5000,  1.0312, -0.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6094, -2.6094,  0.0908,  2.2344, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:51,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.82 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7500, -2.7500,  0.1328,  2.7031, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -2.7344,  0.4668,  2.2031, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9062,  0.3379,  2.1719, -1.0156, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -3.4219,  1.0938,  0.0527, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -3.1875,  0.1299,  1.9531, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656, -0.4023,  3.0000, -0.7305, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:08:52,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.23
[2025-11-06 18:08:52,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.00 | bwd_microstep: 552.62 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 551.42 | step_microstep: 2.08
[2025-11-06 18:08:52,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.84 | bwd: 553.41 | bwd_inner: 1.82 | bwd_allreduce: 551.46 | step: 2.15
 28%|██▊       | 980/3507 [24:05<46:14,  1.10s/it]                                                  {'loss': 0.2938, 'learning_rate': 1.691425008540625e-05, 'epoch': 0.28}
 28%|██▊       | 980/3507 [24:05<46:14,  1.10s/it]tensor([[-5.1875, -3.7500, -0.5898,  0.2539, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -3.4375, -0.9805,  2.3125, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.5000, -7.7812, -3.3438, -2.3438, -7.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062,  0.2754,  2.4531, -1.9609, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.7969, -2.2188,  0.6914,  0.9805, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:52,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.28 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7812, -1.2266,  2.6094,  0.2734, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8906, -2.8750, -0.1543,  1.8594, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -1.0859,  1.7891,  0.4531, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:08:53,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:08:53,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.00 | bwd_microstep: 811.11 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 809.87 | step_microstep: 2.12
[2025-11-06 18:08:53,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.31 | bwd: 812.17 | bwd_inner: 2.08 | bwd_allreduce: 809.93 | step: 2.22
 28%|██▊       | 981/3507 [24:07<55:45,  1.32s/it]                                                  {'loss': 0.8852, 'learning_rate': 1.690757369315648e-05, 'epoch': 0.28}
 28%|██▊       | 981/3507 [24:07<55:45,  1.32s/it]tensor([[-4.4688, -2.5469,  1.0312,  0.8203, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -4.4062, -0.7930,  1.1172, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:54,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.93 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7812, -2.2031,  0.8516,  1.2031, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -1.1953,  1.4688,  0.0405, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -0.7227,  2.7812, -1.9922, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -3.3750,  0.5781,  1.0312, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -1.6953,  1.6953, -0.1006, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -4.5625, -1.1016,  0.0688, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:08:54,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:08:54,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.19 | bwd_microstep: 280.18 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 278.77 | step_microstep: 1.53
[2025-11-06 18:08:54,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.14 | bwd: 281.14 | bwd_inner: 2.20 | bwd_allreduce: 278.80 | step: 1.61
 28%|██▊       | 982/3507 [24:08<48:13,  1.15s/it]                                                  {'loss': 0.3579, 'learning_rate': 1.690089140688166e-05, 'epoch': 0.28}
 28%|██▊       | 982/3507 [24:08<48:13,  1.15s/it]tensor([[-4.2500, -2.9375,  0.4590,  2.2812, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9844, -0.1387,  2.9062, -1.4375, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.5781, -0.9922,  0.9805,  3.7188, -0.4395]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281, -0.8047,  2.2656,  1.1406, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -3.2500, -0.9727,  1.7109, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:55,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4688, -3.3281, -1.6250,  1.9141, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-2.5312,  0.2090,  3.1875, -0.6680, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6562, -3.1094, -0.0850,  0.2227, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:08:56,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 18:08:56,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.22 | bwd_microstep: 1418.76 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1417.57 | step_microstep: 2.18
[2025-11-06 18:08:56,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.65 | bwd: 1419.72 | bwd_inner: 1.98 | bwd_allreduce: 1417.61 | step: 2.26
 28%|██▊       | 983/3507 [24:10<1:03:08,  1.50s/it]                                                    {'loss': 1.406, 'learning_rate': 1.689420323228358e-05, 'epoch': 0.28}
 28%|██▊       | 983/3507 [24:10<1:03:08,  1.50s/it]tensor([[-0.0053,  0.9688,  3.0000,  4.2188,  0.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -1.2344,  1.8828, -0.3203, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:08:57,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.35 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.8906, -1.2031,  1.2656,  0.5703, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.7188,  0.5742, -1.8906, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -0.8711,  2.5781, -0.9453, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -2.3594,  2.0469, -0.3359, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406, -1.1250,  1.9062,  1.0156, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -0.8086,  2.8906, -2.0156, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:08:57,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:08:57,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.55 | bwd_microstep: 267.82 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 266.72 | step_microstep: 1.49
[2025-11-06 18:08:57,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 273.92 | bwd: 268.85 | bwd_inner: 1.96 | bwd_allreduce: 266.76 | step: 1.58
 28%|██▊       | 984/3507 [24:11<51:25,  1.22s/it]                                                    {'loss': 0.2971, 'learning_rate': 1.6887509175069057e-05, 'epoch': 0.28}
 28%|██▊       | 984/3507 [24:11<51:25,  1.22s/it]tensor([[-4.0625, -3.7656, -1.5234,  2.3750, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -4.9062, -0.6094,  0.7617, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0781, -1.7812,  0.9414,  2.0000, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -3.9844, -1.6094,  2.3594, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -3.2812, -1.6953,  1.1250, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:58,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.40 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6406, -3.5312, -2.1875,  0.6562, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-6.2188, -4.0000, -0.4727, -2.2344, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750, -1.7656,  1.6641,  2.4688, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:08:58,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:08:58,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.15 | bwd_microstep: 572.60 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 571.42 | step_microstep: 1.78
[2025-11-06 18:08:58,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.59 | bwd: 573.60 | bwd_inner: 2.00 | bwd_allreduce: 571.46 | step: 1.87
 28%|██▊       | 985/3507 [24:12<54:06,  1.29s/it]                                                  {'loss': 0.8343, 'learning_rate': 1.6880809240949934e-05, 'epoch': 0.28}
 28%|██▊       | 985/3507 [24:12<54:06,  1.29s/it]tensor([[-3.7188, -3.0000, -0.4258,  2.2969, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8750, -1.7031, -0.2871,  3.0156, -0.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.3125, -0.2197,  0.6133, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:08:59,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.17 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.6562, -3.1719,  0.4609,  2.0469, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -1.2422,  2.2344, -1.3594, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -3.5938, -0.0469,  0.2930, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.7500, -0.7148,  1.1328, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -1.7109,  1.5312, -0.3848, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:00,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:09:00,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.75 | bwd_microstep: 1417.11 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 1415.66 | step_microstep: 1.87
[2025-11-06 18:09:00,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.97 | bwd: 1418.22 | bwd_inner: 2.35 | bwd_allreduce: 1415.70 | step: 1.96
 28%|██▊       | 986/3507 [24:14<1:00:26,  1.44s/it]                                                    {'loss': 0.2563, 'learning_rate': 1.687410343564306e-05, 'epoch': 0.28}
 28%|██▊       | 986/3507 [24:14<1:00:26,  1.44s/it]tensor([[-2.8438, -2.0625,  0.0991,  2.1875, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.5625,  2.2969, -0.1455, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -4.9688, -1.7734,  1.2031, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -1.7969,  2.0156,  1.2031, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:00,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.97 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.5000, -0.4727,  2.7500, -1.8438, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.8281,  0.3770,  1.8984, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -0.6602,  2.8125, -1.6719, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9062, -4.5000, -1.2109,  0.1426, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:01,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:09:01,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.34 | bwd_microstep: 631.06 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 629.88 | step_microstep: 1.70
[2025-11-06 18:09:01,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 466.33 | bwd: 632.19 | bwd_inner: 2.14 | bwd_allreduce: 629.93 | step: 1.79
 28%|██▊       | 987/3507 [24:15<56:40,  1.35s/it]                                                    {'loss': 0.2604, 'learning_rate': 1.68673917648703e-05, 'epoch': 0.28}
 28%|██▊       | 987/3507 [24:15<56:40,  1.35s/it]tensor([[-4.6875, -2.3281,  1.5391, -0.2559, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3125, -2.5156,  0.0457,  2.6094, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.1875,  0.1445,  1.3438, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:02,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.31 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -1.1641,  2.9219, -0.6289, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9219, -1.7344,  1.6328,  0.2109, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -2.3281,  1.6641, -0.1914, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -1.4531,  1.3906,  0.5430, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.2812, -0.4297,  1.4219, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:03,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:09:03,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.31 | bwd_microstep: 949.59 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 948.16 | step_microstep: 1.77
[2025-11-06 18:09:03,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.65 | bwd: 950.58 | bwd_inner: 2.24 | bwd_allreduce: 948.20 | step: 1.85
 28%|██▊       | 988/3507 [24:17<57:52,  1.38s/it]                                                  {'loss': 0.2, 'learning_rate': 1.6860674234358517e-05, 'epoch': 0.28}
 28%|██▊       | 988/3507 [24:17<57:52,  1.38s/it]tensor([[-2.6562, -0.0364,  2.7969, -1.1953, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250, -0.1133,  2.6562, -0.7500, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688, -0.0160,  2.5312, -0.3691, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.8125,  0.3672,  1.4844, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -2.5469,  0.7070,  1.7344, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -2.6562,  0.1650,  2.4688, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:03,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.61 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5312, -2.9062,  0.4297,  1.1094, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -2.1719,  1.8281,  1.7500, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:05,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.24
[2025-11-06 18:09:05,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.95 | bwd_microstep: 998.45 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 997.23 | step_microstep: 1.89
[2025-11-06 18:09:05,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.58 | bwd: 999.42 | bwd_inner: 2.02 | bwd_allreduce: 997.27 | step: 1.97
 28%|██▊       | 989/3507 [24:18<1:02:33,  1.49s/it]                                                    {'loss': 0.2712, 'learning_rate': 1.6853950849839582e-05, 'epoch': 0.28}
 28%|██▊       | 989/3507 [24:18<1:02:33,  1.49s/it]tensor([[-3.2188, -0.4590,  2.7812, -1.3828, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6562, -2.8594,  0.4590,  0.3203, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -2.5781,  0.5625,  1.8594, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:05,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.48 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4375, -3.2188,  1.1016,  0.5273, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -3.0156,  1.1094,  1.1797, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -3.4688, -0.3086,  0.9492, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -4.2188, -0.1602,  1.2891, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2031, -2.2656, -0.0187,  1.4297, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:09:05,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:09:05,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.52 | bwd_microstep: 107.96 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 106.80 | step_microstep: 1.46
[2025-11-06 18:09:05,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.03 | bwd: 108.93 | bwd_inner: 1.95 | bwd_allreduce: 106.85 | step: 1.55
 28%|██▊       | 990/3507 [24:19<49:32,  1.18s/it]                                                    {'loss': 0.9919, 'learning_rate': 1.6847221617050354e-05, 'epoch': 0.28}
 28%|██▊       | 990/3507 [24:19<49:32,  1.18s/it]tensor([[-3.8594, -3.3594, -1.0312,  2.2656, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -1.7109,  2.0469, -0.1621, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5781,  0.6875,  2.5625, -0.8125, -1.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.3438, -2.3906,  0.3262,  2.3594, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -1.7969,  1.2500, -0.0047, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -2.1094,  1.1406, -0.4434, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:06,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.47 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1562, -3.2656, -0.4336,  1.8906, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -4.2812, -1.8750,  2.0625, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:07,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.17 | optimizer_step: 0.22
[2025-11-06 18:09:07,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.80 | bwd_microstep: 886.83 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 885.46 | step_microstep: 1.80
[2025-11-06 18:09:07,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.27 | bwd: 887.82 | bwd_inner: 2.18 | bwd_allreduce: 885.50 | step: 1.89
 28%|██▊       | 991/3507 [24:21<1:00:06,  1.43s/it]                                                    {'loss': 0.8363, 'learning_rate': 1.6840486541732685e-05, 'epoch': 0.28}
 28%|██▊       | 991/3507 [24:21<1:00:06,  1.43s/it]tensor([[-3.0000, -0.4141,  2.2344, -1.2734, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8281, -3.0781, -0.5625,  1.7109, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:07,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.81 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.39
tensor([[-3.8125, -1.8203,  1.4688,  0.5273, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.5000, -1.0312,  2.1562, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -3.1406, -1.2031,  2.0781, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6094, -2.4062, -0.8945,  2.3750, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -2.1562,  2.2344, -0.2910, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -1.3203,  2.7500, -0.3789, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:08,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:09:08,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.44 | bwd_microstep: 2.26 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.94
[2025-11-06 18:09:08,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 457.29 | bwd: 3.12 | bwd_inner: 2.09 | bwd_allreduce: 0.88 | step: 2.34
 28%|██▊       | 992/3507 [24:21<48:22,  1.15s/it]                                                    {'loss': 0.8502, 'learning_rate': 1.6833745629633414e-05, 'epoch': 0.28}
 28%|██▊       | 992/3507 [24:21<48:22,  1.15s/it]tensor([[-3.1406, -0.7578,  2.3438,  0.2451, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -2.3438,  1.7422, -0.8516, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:08,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.61 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -1.4141,  2.5312, -1.5938, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -2.0625,  1.8281,  0.5000, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1094, -0.1934,  2.7969, -1.8828, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969,  0.2754,  2.9219, -0.3945, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.5312, -3.8594, -0.2832,  0.2266, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.0781,  0.6992, -0.3867, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:10,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:09:10,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.38 | bwd_microstep: 1989.68 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1988.44 | step_microstep: 2.08
[2025-11-06 18:09:10,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.02 | bwd: 1990.67 | bwd_inner: 2.05 | bwd_allreduce: 1988.48 | step: 2.16
 28%|██▊       | 993/3507 [24:24<1:04:33,  1.54s/it]                                                    {'loss': 0.6588, 'learning_rate': 1.682699888650436e-05, 'epoch': 0.28}
 28%|██▊       | 993/3507 [24:24<1:04:33,  1.54s/it]tensor([[-5.4688, -3.5156,  0.4414,  0.6445, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:10,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.44 | bwd_microstep: 1.41 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5000, -3.2656, -0.1953,  1.2969, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375, -2.1562,  0.8125,  0.4629, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.0938,  0.1895, -0.5742, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2500, -0.4375,  3.1719, -0.2500, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -5.0000, -2.6875,  0.7266, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9844, -1.7969,  1.1953,  2.7656, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -2.2656,  0.6562, -3.0938, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:09:11,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:09:11,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.31 | bwd_microstep: 208.07 | bwd_inner_microstep: 1.47 | bwd_allreduce_microstep: 206.52 | step_microstep: 1.48
[2025-11-06 18:09:11,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.77 | bwd: 209.48 | bwd_inner: 2.81 | bwd_allreduce: 206.55 | step: 1.55
 28%|██▊       | 994/3507 [24:24<52:16,  1.25s/it]                                                    {'loss': 0.3248, 'learning_rate': 1.682024631810231e-05, 'epoch': 0.28}
 28%|██▊       | 994/3507 [24:24<52:16,  1.25s/it]tensor([[-4.8125, -3.3438, -0.0102,  1.2109, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:11,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.31 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3438, -1.9219, -0.2432,  2.5781, -1.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.0625,  0.5977,  1.6875, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2188, -4.7500, -1.2500, -0.0586, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -3.6875,  0.1426,  0.3105, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9062, -3.2812,  0.5078,  1.7031, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -0.4570,  2.2188, -0.7031, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.6875, -0.2500,  1.8359, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:13,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:09:13,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.29 | bwd_microstep: 2271.56 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 2270.16 | step_microstep: 2.07
[2025-11-06 18:09:13,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.62 | bwd: 2272.58 | bwd_inner: 2.22 | bwd_allreduce: 2270.21 | step: 2.15
 28%|██▊       | 995/3507 [24:27<1:09:02,  1.65s/it]                                                    {'loss': 0.4144, 'learning_rate': 1.681348793018904e-05, 'epoch': 0.28}
 28%|██▊       | 995/3507 [24:27<1:09:02,  1.65s/it]tensor([[-4.6250, -2.3438,  0.8125, -0.9883, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -2.4219,  0.1660,  1.8047, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9688, -0.9141,  1.4297,  2.7500, -1.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:13,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.12 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6875, -3.3594,  0.9453,  0.0170, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -2.2500, -0.0457,  3.0156, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.1875, -0.2578,  1.5469, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -2.0781,  1.3281,  0.7578, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -3.8594, -0.6445,  2.2812, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:14,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:09:14,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.10 | bwd_microstep: 66.48 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 65.50 | step_microstep: 1.48
[2025-11-06 18:09:14,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.24 | bwd: 67.54 | bwd_inner: 1.88 | bwd_allreduce: 65.53 | step: 1.56
 28%|██▊       | 996/3507 [24:27<54:19,  1.30s/it]                                                    {'loss': 0.2269, 'learning_rate': 1.680672372853126e-05, 'epoch': 0.28}
 28%|██▊       | 996/3507 [24:27<54:19,  1.30s/it]tensor([[-1.6797,  0.3613,  2.1406, -0.1206, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.0312, -2.5781,  0.5117,  1.3203, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:14,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.58 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7031, -3.5156, -1.7109,  1.8984, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.8750, -3.3125, -0.8398,  2.5156, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1094, -1.5078,  1.4531,  2.0000, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -4.0000, -0.2891,  1.2891, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -2.4062,  1.6562, -0.5703, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.0000, -0.4297,  0.6758, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:09:14,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.15 | optimizer_step: 0.26
[2025-11-06 18:09:14,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.83 | bwd_microstep: 47.52 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 46.31 | step_microstep: 1.79
[2025-11-06 18:09:14,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.44 | bwd: 48.45 | bwd_inner: 1.98 | bwd_allreduce: 46.34 | step: 1.87
 28%|██▊       | 997/3507 [24:28<43:20,  1.04s/it]                                                  {'loss': 0.9505, 'learning_rate': 1.679995371890068e-05, 'epoch': 0.28}
 28%|██▊       | 997/3507 [24:28<43:20,  1.04s/it]tensor([[-2.3594, -2.4688, -1.1953,  2.7344, -0.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.0938,  1.0391, -0.4043, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438, -2.5312, -0.1157,  2.3125, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.4062, -3.8281, -1.1953,  2.0156, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -3.4688,  0.5664, -0.1914, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:14,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.21 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4375, -3.0000, -0.6250,  2.9844, -1.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -2.8750, -0.5586,  2.5625, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -1.6562,  1.0469, -0.6484, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:15,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:09:15,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.07 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.54
[2025-11-06 18:09:15,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.30 | bwd: 3.03 | bwd_inner: 2.00 | bwd_allreduce: 0.90 | step: 1.62
 28%|██▊       | 998/3507 [24:28<37:00,  1.13it/s]                                                  {'loss': 0.6899, 'learning_rate': 1.6793177907073937e-05, 'epoch': 0.28}
 28%|██▊       | 998/3507 [24:28<37:00,  1.13it/s]tensor([[-5.7812, -4.4062, -0.6094,  1.4375, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562, -0.2012,  2.1406, -0.8711, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.1250, -4.7500, -2.2031,  1.4922, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:15,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.07 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8906, -3.0469, -0.4668,  2.0312, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750, -0.5820,  2.0938, -0.4590, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -0.4766,  2.2500, -0.4805, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -3.5312, -0.0229,  2.2500, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -3.5938, -0.1865,  1.0391, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:16,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:09:16,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.52 | bwd_microstep: 1226.34 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1225.12 | step_microstep: 1.89
[2025-11-06 18:09:16,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.61 | bwd: 1227.31 | bwd_inner: 2.02 | bwd_allreduce: 1225.16 | step: 1.98
 28%|██▊       | 999/3507 [24:30<45:39,  1.09s/it]                                                  {'loss': 0.4291, 'learning_rate': 1.6786396298832622e-05, 'epoch': 0.28}
 28%|██▊       | 999/3507 [24:30<45:39,  1.09s/it]tensor([[-3.5938, -3.1406, -0.7383,  2.7969, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -2.2031,  1.9453, -0.3281, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:17,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.39 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5938, -3.9219, -0.1875,  0.3301, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -4.2188, -0.8438, -0.2451, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -2.0625,  1.2969, -0.2734, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -0.8555,  2.3125, -0.2344, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6875,  0.5352,  2.8906,  1.0156, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.2812, -0.6406,  1.3672, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:18,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:09:18,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.56 | bwd_microstep: 754.96 | bwd_inner_microstep: 1.58 | bwd_allreduce_microstep: 753.29 | step_microstep: 1.87
[2025-11-06 18:09:18,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.97 | bwd: 755.82 | bwd_inner: 2.35 | bwd_allreduce: 753.33 | step: 1.96
 29%|██▊       | 1000/3507 [24:32<51:01,  1.22s/it]                                                   {'loss': 0.2953, 'learning_rate': 1.677960889996329e-05, 'epoch': 0.29}
 29%|██▊       | 1000/3507 [24:32<51:01,  1.22s/it]tensor([[-4.5000, -3.2500, -0.2695,  1.1641, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:18,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.71 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.7812, -1.9219,  0.2383,  2.1406, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5156, -0.9883,  2.3125, -0.2930, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7656, -1.0547,  2.5938, -0.2578, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -1.3984,  2.1875,  0.3965, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9062, -1.5781,  1.8906,  0.2480, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4375, -0.8906,  2.2344, -0.5195, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -1.4297,  2.1719, -0.7109, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:19,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:09:19,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.93 | bwd_microstep: 1097.43 | bwd_inner_microstep: 1.46 | bwd_allreduce_microstep: 1095.88 | step_microstep: 1.76
[2025-11-06 18:09:19,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.67 | bwd: 1098.25 | bwd_inner: 2.18 | bwd_allreduce: 1095.92 | step: 1.86
 29%|██▊       | 1001/3507 [24:33<53:32,  1.28s/it]                                                   {'loss': 0.5114, 'learning_rate': 1.6772815716257414e-05, 'epoch': 0.29}
 29%|██▊       | 1001/3507 [24:33<53:32,  1.28s/it]tensor([[0.9727, 3.2969, 4.4375, 0.4766, 0.3535]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0000, -3.5938, -1.3828,  2.0312, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:19,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.70 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0312, -2.3438, -0.2812,  1.9375, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938,  0.0171,  2.2812, -1.7109, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -2.2812,  1.8828, -0.6016, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5312, -0.5703,  1.8594,  0.0884, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -2.1562,  1.1875,  2.0781, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -1.6094,  1.3125,  0.5664, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:21,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:09:21,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.60 | bwd_microstep: 1685.64 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1684.53 | step_microstep: 1.80
[2025-11-06 18:09:21,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.33 | bwd: 1686.48 | bwd_inner: 1.78 | bwd_allreduce: 1684.57 | step: 1.88
 29%|██▊       | 1002/3507 [24:35<1:03:30,  1.52s/it]                                                     {'loss': 0.4798, 'learning_rate': 1.6766016753511415e-05, 'epoch': 0.29}
 29%|██▊       | 1002/3507 [24:35<1:03:30,  1.52s/it]tensor([[-3.1562, -2.3125,  0.0294,  2.2344, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1094,  0.0908,  2.2656, -0.2793, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:21,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.70 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-5.0625, -2.5312,  1.4453, -0.3887, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -2.8281,  0.1924,  1.7656, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4688, -1.6250,  1.6641,  1.6094, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -2.4062,  1.0391,  0.7148, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -2.8594,  0.5938,  1.5234, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -2.9375,  1.3438,  0.4961, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:22,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:09:22,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.60 | bwd_microstep: 708.11 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 707.07 | step_microstep: 1.64
[2025-11-06 18:09:22,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.33 | bwd: 708.87 | bwd_inner: 1.65 | bwd_allreduce: 707.10 | step: 1.71
 29%|██▊       | 1003/3507 [24:36<57:14,  1.37s/it]                                                     {'loss': 0.6403, 'learning_rate': 1.675921201752665e-05, 'epoch': 0.29}
 29%|██▊       | 1003/3507 [24:36<57:14,  1.37s/it]tensor([[-3.6562, -2.9844, -0.4629,  2.7031, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -4.0000, -1.7734,  1.6953, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4531, -2.3594,  0.4316,  2.1094, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:22,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.07 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4844, -0.8203,  2.1094, -1.5312, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -1.6250,  1.2031,  0.1245, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -3.7031, -0.3516, -0.0112, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -3.7031, -0.9609,  1.6328, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -2.8906,  0.0217,  1.8906, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:24,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:09:24,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.69 | bwd_microstep: 1302.96 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1301.81 | step_microstep: 1.97
[2025-11-06 18:09:24,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.78 | bwd: 1303.74 | bwd_inner: 1.77 | bwd_allreduce: 1301.85 | step: 2.04
 29%|██▊       | 1004/3507 [24:38<1:01:27,  1.47s/it]                                                     {'loss': 0.2375, 'learning_rate': 1.675240151410939e-05, 'epoch': 0.29}
 29%|██▊       | 1004/3507 [24:38<1:01:27,  1.47s/it]tensor([[-4.8125, -3.9844, -1.1953,  1.4688, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -3.9844, -0.3770,  1.7266, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -1.9531,  1.6016, -2.2812, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031, -0.0942,  2.7812, -0.3164, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:24,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.17 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1562, -2.9219,  0.1338,  2.0469, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -3.4688, -0.2598,  1.6406, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.1562,  1.0547,  2.4062, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.3340, 0.8438, 2.6719, 6.0312, 1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:09:24,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:09:24,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.11 | bwd_microstep: 20.10 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 18.52 | step_microstep: 2.05
[2025-11-06 18:09:24,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.46 | bwd: 20.90 | bwd_inner: 2.20 | bwd_allreduce: 18.55 | step: 2.12
 29%|██▊       | 1005/3507 [24:38<48:13,  1.16s/it]                                                     {'loss': 0.529, 'learning_rate': 1.6745585249070834e-05, 'epoch': 0.29}
 29%|██▊       | 1005/3507 [24:38<48:13,  1.16s/it]tensor([[-5.1250, -3.7656, -0.0238,  2.0469, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.6406,  1.0156,  1.4297, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:25,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.06 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4688, -2.3594,  1.4922,  0.8359, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -3.2031,  0.2275, -3.0469, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -0.7266,  2.2031, -1.3594, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.8125, -1.6250,  2.1719,  1.6328, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([3], device='cuda:0')
tensor([[-5.0625, -3.8438, -0.6914,  0.8945, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -4.5625, -1.0312, -0.0233, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:26,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 18:09:26,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.52 | bwd_microstep: 1214.76 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1213.65 | step_microstep: 2.34
[2025-11-06 18:09:26,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.61 | bwd: 1215.55 | bwd_inner: 1.72 | bwd_allreduce: 1213.70 | step: 2.42
 29%|██▊       | 1006/3507 [24:40<54:00,  1.30s/it]                                                   {'loss': 0.4758, 'learning_rate': 1.6738763228227094e-05, 'epoch': 0.29}
 29%|██▊       | 1006/3507 [24:40<54:00,  1.30s/it]tensor([[-1.7656,  0.4902,  2.5938, -0.3477, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -2.5625,  0.6680,  1.5625, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -4.4062, -1.5234,  1.0625, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:26,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.30 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8438, -1.6094,  1.9531,  0.6641, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -1.0547,  2.1406, -1.9531, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.3281,  0.6992,  0.1836, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5000, -2.2500,  0.7148,  2.3438, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -2.2031,  2.0781, -0.9766, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:09:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:09:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.16 | bwd_microstep: 1071.55 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1070.24 | step_microstep: 1.63
[2025-11-06 18:09:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.48 | bwd: 1072.63 | bwd_inner: 2.22 | bwd_allreduce: 1070.28 | step: 1.73
 29%|██▊       | 1007/3507 [24:41<55:55,  1.34s/it]                                                   {'loss': 0.3796, 'learning_rate': 1.6731935457399205e-05, 'epoch': 0.29}
 29%|██▊       | 1007/3507 [24:41<55:55,  1.34s/it]tensor([[-4.4062, -3.2812, -0.2363,  1.7891, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:28,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.33 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.19
tensor([[-6.1562, -5.3125, -2.1719,  0.7109, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -2.1094,  1.3516,  1.8828, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7266,  1.0234,  3.4219, -1.1484, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2188, -2.0312,  0.4336,  1.5312, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -1.7578,  1.5938, -0.9023, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.4375, -0.6484,  1.8594, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6562,  0.2852,  3.2656, -0.9258, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:29,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:09:29,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.41 | bwd_microstep: 1081.85 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1080.58 | step_microstep: 1.85
[2025-11-06 18:09:29,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.76 | bwd: 1082.86 | bwd_inner: 2.11 | bwd_allreduce: 1080.63 | step: 2.04
 29%|██▊       | 1008/3507 [24:43<57:26,  1.38s/it]                                                   {'loss': 0.1761, 'learning_rate': 1.6725101942413085e-05, 'epoch': 0.29}
 29%|██▊       | 1008/3507 [24:43<57:26,  1.38s/it]tensor([[-4.2188, -1.6250,  2.2969,  0.1211, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:29,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.15 | bwd_microstep: 1.10 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8125, -3.8281, -0.7500,  1.7422, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -1.5391,  2.1406,  0.3242, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -3.1094, -1.0938,  2.7031, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4766,  1.4141,  4.0312, -0.1963, -1.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0781, -2.2500,  0.0840,  2.0781, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -1.6016,  2.2188, -0.7539, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -2.4219,  0.9062,  1.1094, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:09:30,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:09:30,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.37 | bwd_microstep: 440.28 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 438.99 | step_microstep: 1.82
[2025-11-06 18:09:30,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 262.54 | bwd: 441.37 | bwd_inner: 2.23 | bwd_allreduce: 439.02 | step: 1.89
 29%|██▉       | 1009/3507 [24:43<49:20,  1.19s/it]                                                   {'loss': 0.4259, 'learning_rate': 1.6718262689099577e-05, 'epoch': 0.29}
 29%|██▉       | 1009/3507 [24:43<49:20,  1.19s/it]tensor([[-6.4375, -4.4375, -1.1094, -2.0938, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.3750, -0.0425,  1.2656, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:30,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.02 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -1.4297,  1.9375, -0.9414, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -1.2734,  2.0000, -0.5469, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -3.5156, -1.2266,  2.0469, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9375, -0.5039,  2.0312, -0.6328, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.0000,  1.2969,  1.1719, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.4375, -1.7578,  1.5078, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:31,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:09:31,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.75 | bwd_microstep: 1400.86 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1399.59 | step_microstep: 2.02
[2025-11-06 18:09:31,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.80 | bwd: 1401.86 | bwd_inner: 2.08 | bwd_allreduce: 1399.63 | step: 2.11
 29%|██▉       | 1010/3507 [24:45<56:44,  1.36s/it]                                                   {'loss': 0.232, 'learning_rate': 1.6711417703294404e-05, 'epoch': 0.29}
 29%|██▉       | 1010/3507 [24:45<56:44,  1.36s/it]tensor([[-3.6250, -0.9414,  1.9922, -1.4844, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:32,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.59 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1250, -3.9219, -2.1094,  1.3047, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -2.0625,  0.5625,  1.9844, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -2.7812,  0.5547,  0.6289, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -2.7500,  1.1719,  0.7891, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -3.8125, -1.6719,  1.7812, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -3.2656, -0.1338,  1.0312, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2656, -1.3125,  1.2266, -0.1230, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:32,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:09:32,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.20 | bwd_microstep: 560.25 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 559.26 | step_microstep: 1.69
[2025-11-06 18:09:32,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 251.82 | bwd: 561.09 | bwd_inner: 1.68 | bwd_allreduce: 559.29 | step: 1.76
 29%|██▉       | 1011/3507 [24:46<50:11,  1.21s/it]                                                   {'loss': 0.296, 'learning_rate': 1.6704566990838192e-05, 'epoch': 0.29}
 29%|██▉       | 1011/3507 [24:46<50:11,  1.21s/it]tensor([[-3.7812, -1.4219,  1.9609,  0.1104, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:09:32,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.51 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5000, -1.5859,  1.4609,  0.9141, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -0.9727,  2.6094, -0.4395, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9688, -3.4375,  0.5117, -1.2188, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1953,  1.2109,  3.4531, -0.0530, -1.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7344,  1.0781,  3.4688,  2.5938, -0.4102]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0156, -1.5625,  0.8047,  0.7227, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -4.0000, -0.8398,  1.4062, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:35,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:09:35,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.05 | bwd_microstep: 1750.94 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1749.98 | step_microstep: 2.01
[2025-11-06 18:09:35,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.57 | bwd: 1751.86 | bwd_inner: 1.69 | bwd_allreduce: 1750.02 | step: 2.10
 29%|██▉       | 1012/3507 [24:48<1:03:32,  1.53s/it]                                                     {'loss': 0.7192, 'learning_rate': 1.6697710557576448e-05, 'epoch': 0.29}
 29%|██▉       | 1012/3507 [24:48<1:03:32,  1.53s/it]tensor([[-4.7500, -3.4844, -0.2109,  1.5547, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.9688, -0.5625,  2.3594, -0.1631, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([2], device='cuda:2')
tensor([[-4.4062, -3.5469, -0.9023,  1.1250, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -0.1416,  2.4688, -1.3750, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -2.8750,  0.0220,  1.3594, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:35,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.18 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -2.1250,  0.9883, -0.3418, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -3.1250, -0.1855, -1.0234, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -3.6562, -0.5859, -0.1104, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:36,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:09:36,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.25 | bwd_microstep: 895.84 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 894.53 | step_microstep: 1.96
[2025-11-06 18:09:36,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.45 | bwd: 896.97 | bwd_inner: 2.27 | bwd_allreduce: 894.57 | step: 2.04
 29%|██▉       | 1013/3507 [24:50<1:01:26,  1.48s/it]                                                     {'loss': 0.36, 'learning_rate': 1.6690848409359555e-05, 'epoch': 0.29}
 29%|██▉       | 1013/3507 [24:50<1:01:26,  1.48s/it]tensor([[-1.7734,  0.6367,  2.5469, -0.9336, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -2.7656,  0.6484,  0.2500, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:36,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.70 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3750, -5.0938, -2.7188,  1.0703, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -0.8242,  2.4531, -0.8438, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -2.6562,  0.2500,  1.3359, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0625, -2.6250,  0.8047,  2.0938, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -3.4844, -0.9336,  1.9688, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -2.8438,  0.3730, -0.5625, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:37,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:09:37,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.88 | bwd_microstep: 1236.73 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1235.52 | step_microstep: 1.57
[2025-11-06 18:09:37,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.60 | bwd: 1237.71 | bwd_inner: 2.01 | bwd_allreduce: 1235.57 | step: 1.66
 29%|██▉       | 1014/3507 [24:51<1:02:56,  1.51s/it]                                                     {'loss': 0.234, 'learning_rate': 1.668398055204278e-05, 'epoch': 0.29}
 29%|██▉       | 1014/3507 [24:51<1:02:56,  1.51s/it]tensor([[-3.3750, -3.3125, -1.4297,  2.7188, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:38,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.18 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.20
tensor([[-4.0625, -3.7812, -1.6094,  1.9844, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -4.6875, -1.1641,  1.1016, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062,  0.0908,  3.0156, -1.4141, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4688, -2.0938,  0.1855, -2.4844, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.6875, -4.1875, -1.6172,  1.6953, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -0.7266,  2.3281, -1.6719, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -2.9531, -0.1514,  0.9766, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:39,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:09:39,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.55 | bwd_microstep: 790.96 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 789.83 | step_microstep: 1.83
[2025-11-06 18:09:39,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 276.75 | bwd: 791.97 | bwd_inner: 1.96 | bwd_allreduce: 789.86 | step: 2.03
 29%|██▉       | 1015/3507 [24:52<57:45,  1.39s/it]                                                     {'loss': 0.761, 'learning_rate': 1.6677106991486264e-05, 'epoch': 0.29}
 29%|██▉       | 1015/3507 [24:52<57:45,  1.39s/it]tensor([[-4.0312, -3.6875, -1.4922,  1.9766, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:39,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.55 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.9688, -4.8125, -2.6562,  1.1484, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.5625, -1.5781,  1.1641,  0.0136, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -2.3594,  1.8281, -0.3828, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.2656,  0.8672, -0.0400, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -5.1250, -2.7188,  1.5547, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.0312,  1.4531,  0.5000, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -0.7070,  1.4609, -1.6719, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:09:40,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:09:40,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.66 | bwd_microstep: 1259.70 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1258.62 | step_microstep: 1.83
[2025-11-06 18:09:40,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.18 | bwd: 1260.58 | bwd_inner: 1.80 | bwd_allreduce: 1258.65 | step: 1.92
 29%|██▉       | 1016/3507 [24:54<1:01:07,  1.47s/it]                                                     {'loss': 1.0939, 'learning_rate': 1.6670227733555004e-05, 'epoch': 0.29}
 29%|██▉       | 1016/3507 [24:54<1:01:07,  1.47s/it]tensor([[-4.5312, -2.2969,  1.2891,  0.2500, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:40,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.20 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -3.4219, -0.1924,  1.4141, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6406,  0.1128,  2.9375,  2.7031, -0.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -4.0312, -0.0272,  1.4766, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -3.6250, -1.7578,  1.9297, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000, -1.1875,  2.5312, -0.6719, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -1.6406,  2.4688, -0.6133, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.6406, -0.4902,  1.8828, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:09:41,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:09:41,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.09 | bwd_microstep: 198.36 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 197.24 | step_microstep: 1.67
[2025-11-06 18:09:41,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.32 | bwd: 199.28 | bwd_inner: 1.88 | bwd_allreduce: 197.28 | step: 1.75
 29%|██▉       | 1017/3507 [24:55<49:22,  1.19s/it]                                                     {'loss': 0.4236, 'learning_rate': 1.6663342784118865e-05, 'epoch': 0.29}
 29%|██▉       | 1017/3507 [24:55<49:22,  1.19s/it]tensor([[-5.3125, -3.4844,  0.3145,  0.6523, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -3.0625, -0.5234,  3.2656, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:41,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.04 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.1250, -4.1875, -1.0547,  1.6094, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -2.0000,  1.2969,  1.2266, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -3.9375, -1.6797,  2.0781, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8125,  0.1660,  3.5000, -0.1895, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0312,  0.1172,  3.1875, -1.3828, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4219, -0.1914,  1.9766, -0.6406, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:43,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:09:43,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.12 | bwd_microstep: 1970.99 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1969.80 | step_microstep: 2.72
[2025-11-06 18:09:43,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.19 | bwd: 1971.85 | bwd_inner: 1.88 | bwd_allreduce: 1969.84 | step: 2.79
 29%|██▉       | 1018/3507 [24:57<1:05:16,  1.57s/it]                                                     {'loss': 0.2222, 'learning_rate': 1.6656452149052568e-05, 'epoch': 0.29}
 29%|██▉       | 1018/3507 [24:57<1:05:16,  1.57s/it]tensor([[-2.5156, -2.2500, -0.3047,  3.2031, -1.1016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -3.5156, -0.5625,  1.7188, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -4.1250, -0.7812, -0.8008, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:43,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.07 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.3125, -3.1250,  1.0312,  0.4082, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -2.2031,  1.3047, -0.1963, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -2.3125,  1.5156, -0.0342, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406, -1.7266,  1.1797,  2.1250, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1406,  0.4531,  2.9844, -1.0781, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:44,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:09:44,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.20 | bwd_microstep: 41.79 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 40.52 | step_microstep: 1.49
[2025-11-06 18:09:44,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.29 | bwd: 42.61 | bwd_inner: 1.93 | bwd_allreduce: 40.55 | step: 1.56
 29%|██▉       | 1019/3507 [24:58<51:06,  1.23s/it]                                                     {'loss': 0.4006, 'learning_rate': 1.6649555834235686e-05, 'epoch': 0.29}
 29%|██▉       | 1019/3507 [24:58<51:06,  1.23s/it]tensor([[-1.9219,  0.9961,  3.7188, -0.5508, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9062, -2.8594,  0.0757,  2.4688, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6953,  0.6680,  2.9375,  0.5977, -1.4766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:44,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.83 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-3.7188, -2.7188, -0.1572,  1.7656, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -1.9297,  1.6016, -0.6484, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -2.9688,  1.3828, -0.0583, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -3.8125,  0.2793,  0.9688, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -2.9844, -1.2344,  2.3281, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:46,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.31
[2025-11-06 18:09:46,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.45 | bwd_microstep: 1456.02 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1454.98 | step_microstep: 2.22
[2025-11-06 18:09:46,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.30 | bwd: 1456.80 | bwd_inner: 1.62 | bwd_allreduce: 1455.04 | step: 2.32
 29%|██▉       | 1020/3507 [24:59<58:51,  1.42s/it]                                                   {'loss': 0.3524, 'learning_rate': 1.6642653845552643e-05, 'epoch': 0.29}
 29%|██▉       | 1020/3507 [24:59<58:51,  1.42s/it]tensor([[-7.4375, -5.0000, -0.8906, -2.5625, -5.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -4.8438, -1.6406,  0.7266, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:46,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.36 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3438, -2.6562,  0.7461,  1.3125, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -3.5312,  0.4961, -0.0854, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -0.7344,  2.9219, -0.9883, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -3.2500,  0.9062,  0.8008, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -2.7500,  1.0938,  0.8984, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -1.0391,  1.8750, -0.3809, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:46,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:09:46,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.06 | bwd_microstep: 245.34 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 243.75 | step_microstep: 1.62
[2025-11-06 18:09:46,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.44 | bwd: 246.28 | bwd_inner: 2.37 | bwd_allreduce: 243.79 | step: 1.69
 29%|██▉       | 1021/3507 [25:00<48:38,  1.17s/it]                                                   {'loss': 0.3761, 'learning_rate': 1.663574618889268e-05, 'epoch': 0.29}
 29%|██▉       | 1021/3507 [25:00<48:38,  1.17s/it]tensor([[-3.8906, -3.9062, -2.4844,  0.9688, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:46,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.28 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1406, -2.6875, -0.6250,  2.3125, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6719, -0.8398,  2.4531,  2.4688, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -1.9219,  1.9453,  0.0649, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.5000, -0.1562,  1.8281, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -4.0312, -2.0469,  1.4844, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1719, -2.7031, -0.6133,  2.4844, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.7500, -0.8750,  1.0312, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:49,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:09:49,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.31 | bwd_microstep: 2541.28 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 2540.28 | step_microstep: 2.21
[2025-11-06 18:09:49,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.61 | bwd: 2542.28 | bwd_inner: 1.83 | bwd_allreduce: 2540.32 | step: 2.29
 29%|██▉       | 1022/3507 [25:03<1:10:47,  1.71s/it]                                                     {'loss': 0.185, 'learning_rate': 1.6628832870149913e-05, 'epoch': 0.29}
 29%|██▉       | 1022/3507 [25:03<1:10:47,  1.71s/it]tensor([[-3.8125, -2.9844, -0.0884,  2.9219, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:49,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.74 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-12.0625, -10.8750,  -6.1562,  -3.0625,  -8.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -2.1562,  1.3516,  2.0469, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -2.3750,  1.3516,  0.5391, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062, -3.1250, -1.6016,  1.2344, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -2.4844,  1.4922,  0.9883, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -3.0781,  0.1357,  1.4688, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.4375, -0.4336,  2.1250, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:09:50,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:09:50,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.98 | bwd_microstep: 48.27 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 47.10 | step_microstep: 1.41
[2025-11-06 18:09:50,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.75 | bwd: 49.14 | bwd_inner: 1.88 | bwd_allreduce: 47.13 | step: 1.48
 29%|██▉       | 1023/3507 [25:03<55:05,  1.33s/it]                                                     {'loss': 0.296, 'learning_rate': 1.662191389522326e-05, 'epoch': 0.29}
 29%|██▉       | 1023/3507 [25:03<55:05,  1.33s/it]tensor([[-4.5938, -2.7500,  0.5625,  0.2188, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:50,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.84 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -4.5000, -1.4531,  1.7734, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -1.7734,  2.0000, -0.9102, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -3.1250,  1.0000,  0.8867, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1250, -1.2812,  0.9805,  3.2188, -0.9961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4688, -0.6211,  2.2656,  1.9922, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.6719, -0.7891,  1.2578, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -2.1406,  2.2969, -1.3125, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:51,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:09:51,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.63 | bwd_microstep: 1575.77 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1574.66 | step_microstep: 2.22
[2025-11-06 18:09:51,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.50 | bwd: 1576.66 | bwd_inner: 1.83 | bwd_allreduce: 1574.70 | step: 2.30
 29%|██▉       | 1024/3507 [25:05<1:02:22,  1.51s/it]                                                     {'loss': 0.2839, 'learning_rate': 1.6614989270016474e-05, 'epoch': 0.29}
 29%|██▉       | 1024/3507 [25:05<1:02:22,  1.51s/it]tensor([[-4.4375, -3.6250, -0.8320,  1.8047, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -3.5781, -1.4141,  2.2188, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -2.9688,  0.1289,  0.4102, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.1875,  1.0234,  0.6992, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:52,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.97 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.9609,  0.5352,  2.5625, -1.3516, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6250, -0.6172,  1.9688,  0.6836, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -4.8750, -0.4551,  0.5000, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000, -3.3438, -0.6602,  2.4688, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:09:52,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:09:52,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.85 | bwd_microstep: 28.18 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 27.05 | step_microstep: 1.49
[2025-11-06 18:09:52,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.83 | bwd: 29.00 | bwd_inner: 1.80 | bwd_allreduce: 27.08 | step: 1.56
 29%|██▉       | 1025/3507 [25:06<49:00,  1.18s/it]                                                     {'loss': 0.269, 'learning_rate': 1.660805900043813e-05, 'epoch': 0.29}
 29%|██▉       | 1025/3507 [25:06<49:00,  1.18s/it]tensor([[-4.8750, -3.6875, -0.6406,  0.9805, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1250,  0.1455,  2.5000,  0.1270, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -3.2969, -1.3281,  1.7109, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -1.3828,  2.4375, -0.8828, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:52,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.14 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5312, -2.4531,  1.1875,  0.3828, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -3.3594,  0.2236, -1.0859, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -2.1406,  2.2812, -0.3828, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -1.3906,  1.8438,  0.3613, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:09:54,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:09:54,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.86 | bwd_microstep: 1383.69 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 1382.26 | step_microstep: 1.68
[2025-11-06 18:09:54,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.03 | bwd: 1384.61 | bwd_inner: 2.17 | bwd_allreduce: 1382.29 | step: 1.74
 29%|██▉       | 1026/3507 [25:08<57:13,  1.38s/it]                                                   {'loss': 0.1902, 'learning_rate': 1.6601123092401624e-05, 'epoch': 0.29}
 29%|██▉       | 1026/3507 [25:08<57:13,  1.38s/it]tensor([[-3.1250, -0.6758,  1.4688, -1.8672, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -3.6250, -1.4922,  2.7031, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -4.2500, -1.7656,  1.4141, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344, -0.8750,  1.8828, -0.4902, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:54,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.13 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -2.1250,  1.3672,  0.4238, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -2.9375,  0.0483,  2.9062, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5000, -0.8008,  2.1875, -1.3047, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -2.9375, -0.1196,  1.4453, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:09:54,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.20
[2025-11-06 18:09:54,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.56 | bwd_microstep: 2.20 | bwd_inner_microstep: 1.39 | bwd_allreduce_microstep: 0.73 | step_microstep: 1.61
[2025-11-06 18:09:54,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.71 | bwd: 3.36 | bwd_inner: 2.47 | bwd_allreduce: 0.76 | step: 1.69
 29%|██▉       | 1027/3507 [25:08<44:51,  1.09s/it]                                                   {'loss': 0.3372, 'learning_rate': 1.659418155182515e-05, 'epoch': 0.29}
 29%|██▉       | 1027/3507 [25:08<44:51,  1.09s/it]tensor([[-4.2500, -2.2656,  1.2578,  0.8477, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -3.0938, -0.9102,  1.7031, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -3.3594,  0.1621,  1.1719, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -3.5156,  0.6719,  0.9219, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:09:54,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.90 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5000, -0.0693,  1.7812, -1.3359, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
tensor([[-4.5000, -2.2500,  1.7734,  1.1484, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812, -2.4062, -0.4805,  2.8438, -1.3203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -2.9062,  0.7422,  0.5078, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:09:55,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:09:55,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.40 | bwd_microstep: 56.23 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 55.12 | step_microstep: 1.40
[2025-11-06 18:09:55,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.33 | bwd: 57.26 | bwd_inner: 1.99 | bwd_allreduce: 55.15 | step: 1.49
 29%|██▉       | 1028/3507 [25:08<37:37,  1.10it/s]                                                   {'loss': 0.9263, 'learning_rate': 1.6587234384631718e-05, 'epoch': 0.29}
 29%|██▉       | 1028/3507 [25:08<37:37,  1.10it/s]tensor([[-4.0000, -3.0781, -0.2793,  2.1250, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9531, -0.5859,  1.8906, -0.9453, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0938, -3.1406, -1.5312,  2.7031, -1.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:55,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.65 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0781, -2.9062, -1.1797,  2.3438, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.7656, -1.6797,  1.3594,  0.3691, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2188, -3.3594, -0.6758,  1.8984, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -4.1875, -2.5156,  1.7422, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3281, -1.1016,  1.5703,  2.7656, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:09:56,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.26
[2025-11-06 18:09:56,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.74 | bwd_microstep: 334.80 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 333.41 | step_microstep: 1.79
[2025-11-06 18:09:56,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.40 | bwd: 335.83 | bwd_inner: 2.19 | bwd_allreduce: 333.46 | step: 1.87
 29%|██▉       | 1029/3507 [25:10<45:02,  1.09s/it]                                                   {'loss': 1.3338, 'learning_rate': 1.658028159674914e-05, 'epoch': 0.29}
 29%|██▉       | 1029/3507 [25:10<45:02,  1.09s/it]tensor([[-2.9062e+00, -4.0625e-01,  2.4219e+00, -1.6174e-03, -2.4531e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4375, -1.3281,  1.3516,  3.1562, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:56,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.16 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.6250, -3.7500, -0.1025, -0.1416, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -4.1562,  0.1035,  1.5312, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -3.9688, -1.0703,  2.0312, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -3.3906, -1.4688,  1.9141, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7188, -3.6094, -1.7266,  2.1406, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -3.0781,  1.0469,  0.8594, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:09:57,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:09:57,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.40 | bwd_microstep: 567.23 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 565.83 | step_microstep: 1.92
[2025-11-06 18:09:57,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.59 | bwd: 568.14 | bwd_inner: 2.07 | bwd_allreduce: 565.88 | step: 2.02
 29%|██▉       | 1030/3507 [25:11<44:43,  1.08s/it]                                                   {'loss': 0.4584, 'learning_rate': 1.657332319411002e-05, 'epoch': 0.29}
 29%|██▉       | 1030/3507 [25:11<44:43,  1.08s/it]tensor([[-1.6094,  0.9727,  2.5469, -1.6562, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.3438, -5.5000, -2.2969,  0.5156, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:09:57,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.11 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8281, -0.6719,  2.3281,  0.5312, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -3.7812, -0.8516,  3.0625, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.5625, -0.0559,  2.0156, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4844, -3.4375, -1.7578,  1.9766, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -3.5156, -0.3789,  1.6797, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -3.0625, -0.2471,  2.0781, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:00,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:10:00,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.66 | bwd_microstep: 1955.54 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1954.40 | step_microstep: 1.90
[2025-11-06 18:10:00,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.81 | bwd: 1956.57 | bwd_inner: 1.98 | bwd_allreduce: 1954.43 | step: 1.98
 29%|██▉       | 1031/3507 [25:13<1:00:21,  1.46s/it]                                                     {'loss': 0.8048, 'learning_rate': 1.6566359182651758e-05, 'epoch': 0.29}
 29%|██▉       | 1031/3507 [25:13<1:00:21,  1.46s/it]tensor([[-2.2969,  0.0359,  1.9766, -1.6094, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -2.2656,  0.1621,  1.6719, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -3.6875,  0.4102,  0.7461, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0000, -5.0938, -1.0938, -0.8438, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:00,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.14 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.9062, -3.7344, -0.2539,  1.8828, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750, -1.0859,  2.0156, -0.1348, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -1.2266,  2.4062, -0.5078, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -4.5625, -1.0469,  0.1934, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:00,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:10:00,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.92 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.97
[2025-11-06 18:10:00,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.09 | bwd: 2.59 | bwd_inner: 1.67 | bwd_allreduce: 0.79 | step: 2.04
 29%|██▉       | 1032/3507 [25:14<47:25,  1.15s/it]                                                     {'loss': 0.4604, 'learning_rate': 1.6559389568316525e-05, 'epoch': 0.29}
 29%|██▉       | 1032/3507 [25:14<47:25,  1.15s/it]tensor([[-4.5625, -4.0625, -1.4297,  1.8516, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:00,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.67 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.4531, -0.1465,  2.4062,  0.1289, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -2.9531,  1.1406,  0.3828, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750, -2.9219, -0.7188,  2.3125, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062,  0.2578,  2.3750, -1.5312, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.1875, -4.2812,  0.0483,  0.8438, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -2.2969,  1.4375, -0.3887, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -3.4375,  0.2305,  2.3125, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:01,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:10:01,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.66 | bwd_microstep: 564.83 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 563.88 | step_microstep: 1.80
[2025-11-06 18:10:01,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.36 | bwd: 565.86 | bwd_inner: 1.82 | bwd_allreduce: 563.91 | step: 1.88
 29%|██▉       | 1033/3507 [25:15<45:37,  1.11s/it]                                                   {'loss': 0.5562, 'learning_rate': 1.655241435705129e-05, 'epoch': 0.29}
 29%|██▉       | 1033/3507 [25:15<45:37,  1.11s/it]tensor([[-2.8125, -0.2266,  2.5312, -0.8125, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:01,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 1.10 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4219, -1.8359,  0.8789,  0.5547, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -2.5625,  0.4629,  2.2656, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -1.5469,  2.0000, -1.8906, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -3.2344, -0.6211,  3.0938, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -3.7500, -0.5078,  1.5547, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000, -2.5469,  0.5117,  1.1094, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -3.3750, -0.6406,  2.5625, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:10:03,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:10:03,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.83 | bwd_microstep: 651.97 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 650.81 | step_microstep: 2.00
[2025-11-06 18:10:03,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.94 | bwd: 653.07 | bwd_inner: 2.09 | bwd_allreduce: 650.84 | step: 2.08
 29%|██▉       | 1034/3507 [25:16<51:49,  1.26s/it]                                                   {'loss': 0.2027, 'learning_rate': 1.6545433554807796e-05, 'epoch': 0.29}
 29%|██▉       | 1034/3507 [25:16<51:49,  1.26s/it]tensor([[-3.2969, -0.4844,  2.6875, -0.9102, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.9219, -1.6875,  1.9688, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:03,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.05 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -3.5156, -0.1426,  2.2812, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -3.5781, -0.2197,  1.7812, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -1.5781,  1.3281,  0.5000, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -4.3125, -2.5938,  0.9375, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -1.9922,  2.1250,  0.2178, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -0.8281,  2.6250,  0.0796, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:03,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:10:03,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.10 | bwd_microstep: 29.92 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 28.73 | step_microstep: 2.02
[2025-11-06 18:10:03,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.18 | bwd: 31.16 | bwd_inner: 2.27 | bwd_allreduce: 28.76 | step: 2.10
 30%|██▉       | 1035/3507 [25:17<43:09,  1.05s/it]                                                   {'loss': 0.2412, 'learning_rate': 1.653844716754254e-05, 'epoch': 0.3}
 30%|██▉       | 1035/3507 [25:17<43:09,  1.05s/it]tensor([[-6.2500, -3.1875,  1.4688, -1.7422, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -1.1250,  2.8438, -0.0693, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:03,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.46 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
[h264 @ 0x9815180] mmco: unref short failure
[h264 @ 0x9815180] mmco: unref short failure
tensor([[-4.2188, -2.2500,  1.0312,  0.2969, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -3.2188,  0.0264, -1.2656, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5312, -0.3340,  2.7031,  1.0391, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1250, -7.0000, -3.3281, -0.9297, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.2812,  0.1387,  2.4844, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -1.0938,  2.3594, -2.2969, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:05,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:10:05,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.26 | bwd_microstep: 2.09 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.91 | step_microstep: 204.84
[2025-11-06 18:10:05,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.74 | bwd: 3.26 | bwd_inner: 2.12 | bwd_allreduce: 0.96 | step: 204.94
 30%|██▉       | 1036/3507 [25:19<53:18,  1.29s/it]                                                   {'loss': 0.3714, 'learning_rate': 1.6531455201216803e-05, 'epoch': 0.3}
 30%|██▉       | 1036/3507 [25:19<53:18,  1.29s/it]tensor([[-4.0938, -2.5156,  0.8242,  1.7969, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:05,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.12 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -2.3594,  1.8594, -0.0728, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -4.0938, -1.3359,  2.9219, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -1.4766,  1.5938,  0.6016, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -1.7969,  1.0078,  1.2812, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -4.9062, -1.6250,  1.4375, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.7031,  0.5234,  1.2266, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625, -0.9414,  1.8750, -0.0835, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:07,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.27 | optimizer_step: 0.33
[2025-11-06 18:10:07,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.78 | bwd_microstep: 1187.84 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1186.61 | step_microstep: 2.67
[2025-11-06 18:10:07,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.91 | bwd: 1188.73 | bwd_inner: 1.92 | bwd_allreduce: 1186.67 | step: 2.76
 30%|██▉       | 1037/3507 [25:20<56:55,  1.38s/it]                                                   {'loss': 0.5173, 'learning_rate': 1.6524457661796626e-05, 'epoch': 0.3}
 30%|██▉       | 1037/3507 [25:20<56:55,  1.38s/it]tensor([[-3.5469, -1.2500,  1.3750, -1.2188, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:07,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.62 | bwd_microstep: 1.13 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.1875, -3.3750, -0.5703,  2.1562, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -2.9844, -1.2109,  2.8906, -1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -3.1875,  0.6836,  1.6016, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.4688,  0.4375,  0.7031, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562, -1.6328,  1.5000,  0.5781, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -0.7891,  3.3125, -0.5391, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -4.7188, -2.1562,  1.5625, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:08,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:10:08,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.74 | bwd_microstep: 1071.46 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1070.26 | step_microstep: 1.69
[2025-11-06 18:10:08,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.37 | bwd: 1072.58 | bwd_inner: 2.12 | bwd_allreduce: 1070.32 | step: 1.79
 30%|██▉       | 1038/3507 [25:22<57:36,  1.40s/it]                                                   {'loss': 0.2397, 'learning_rate': 1.6517454555252787e-05, 'epoch': 0.3}
 30%|██▉       | 1038/3507 [25:22<57:36,  1.40s/it]tensor([[-4.3438, -2.7031,  0.5977,  1.0625, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -0.6406,  3.0000, -0.4824, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:08,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.24 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1562, -3.3906, -0.5508,  2.0156, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719,  0.1562,  2.6875, -1.6406, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000, -0.7422,  1.7109,  0.5586, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -1.7969,  2.1562,  0.0693, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -2.8438,  1.6172,  0.2256, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -3.0781,  0.5703,  1.8203, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:10:09,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:10:09,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.92 | bwd_microstep: 377.71 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 376.46 | step_microstep: 1.59
[2025-11-06 18:10:09,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.19 | bwd: 378.58 | bwd_inner: 1.91 | bwd_allreduce: 376.50 | step: 1.67
 30%|██▉       | 1039/3507 [25:23<49:59,  1.22s/it]                                                   {'loss': 0.2787, 'learning_rate': 1.6510445887560838e-05, 'epoch': 0.3}
 30%|██▉       | 1039/3507 [25:23<49:59,  1.22s/it]tensor([[-4.1562, -1.4219,  1.7266, -1.8984, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1406, -2.3594,  0.1050,  2.5938, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -2.8906, -0.7578,  3.0156, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:09,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.73 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.19
tensor([[-6.0312, -4.8125, -1.1953,  0.8594, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -5.3750, -1.9297, -1.0391, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -2.1875,  1.6562,  1.0859, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -3.5469, -0.1621,  2.0000, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0625, -3.9062, -1.9375,  1.5469, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:11,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:10:11,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.33 | bwd_microstep: 1885.80 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 1884.49 | step_microstep: 1.98
[2025-11-06 18:10:11,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.08 | bwd: 1886.88 | bwd_inner: 2.21 | bwd_allreduce: 1884.54 | step: 2.17
 30%|██▉       | 1040/3507 [25:25<1:03:05,  1.53s/it]                                                     {'loss': 0.5624, 'learning_rate': 1.6503431664701052e-05, 'epoch': 0.3}
 30%|██▉       | 1040/3507 [25:25<1:03:05,  1.53s/it]tensor([[-3.9062, -3.8281, -1.9297,  1.9609, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -1.8828, -0.2119,  2.1875, -1.2109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:11,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.88 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1562, -2.6406,  0.5898,  1.5938, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.4375, -0.5234,  2.1094, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -2.3125,  1.5859,  1.2266, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -3.5938,  0.6836,  1.1016, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -3.3750,  0.3848,  0.5781, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -3.2188,  0.6328,  1.3672, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:12,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:10:12,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.17 | bwd_microstep: 416.47 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 415.35 | step_microstep: 1.79
[2025-11-06 18:10:12,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.07 | bwd: 417.26 | bwd_inner: 1.76 | bwd_allreduce: 415.38 | step: 1.87
 30%|██▉       | 1041/3507 [25:26<54:01,  1.31s/it]                                                     {'loss': 0.405, 'learning_rate': 1.6496411892658465e-05, 'epoch': 0.3}
 30%|██▉       | 1041/3507 [25:26<54:01,  1.31s/it]tensor([[-6.5000, -4.8438, -1.0391, -0.1992, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.8906, -1.3281,  2.3750, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.8125, -0.4414,  2.7344, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:12,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.06 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0625, -2.3594,  1.8125, -0.7148, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -2.9531, -0.7930,  2.7812, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1250, -2.6250, -0.1836,  3.1719, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -1.4219,  2.4375, -1.2422, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -2.1406,  1.0625,  1.0469, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:14,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:10:14,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.80 | bwd_microstep: 1896.86 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1895.68 | step_microstep: 1.88
[2025-11-06 18:10:14,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.88 | bwd: 1897.92 | bwd_inner: 2.07 | bwd_allreduce: 1895.72 | step: 1.96
 30%|██▉       | 1042/3507 [25:28<1:06:20,  1.61s/it]                                                     {'loss': 0.679, 'learning_rate': 1.648938657742283e-05, 'epoch': 0.3}
 30%|██▉       | 1042/3507 [25:28<1:06:20,  1.61s/it]tensor([[-5.0000, -2.4531,  1.4609, -0.6523, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:14,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.63 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -2.8594,  0.2080,  1.7734, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.6719,  0.7188,  1.8984, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.9688, -1.8203,  1.6641, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -3.9219,  0.2930,  0.8438, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3281, -1.4219,  1.8281,  1.6172, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.0938,  0.2402,  1.8516, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.6719, -0.8672,  1.8438, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:15,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:10:15,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.03 | bwd_microstep: 257.11 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 255.78 | step_microstep: 1.47
[2025-11-06 18:10:15,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.62 | bwd: 258.01 | bwd_inner: 2.07 | bwd_allreduce: 255.81 | step: 1.55
 30%|██▉       | 1043/3507 [25:29<53:42,  1.31s/it]                                                     {'loss': 0.5183, 'learning_rate': 1.6482355724988646e-05, 'epoch': 0.3}
 30%|██▉       | 1043/3507 [25:29<53:42,  1.31s/it]tensor([[-4.1250, -2.2969,  0.9961,  0.7188, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -3.5781, -1.5703,  2.3125, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:15,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.00 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2344, -2.9688, -1.1719,  2.2344, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -3.3594,  0.5547,  1.6562, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -3.3594, -0.2832,  1.7109, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -4.9688, -2.9219,  0.5977, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.5938, -0.4668,  2.5781, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -3.7969, -0.9922,  1.6016, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:16,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:10:16,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.10 | bwd_microstep: 523.26 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 522.21 | step_microstep: 1.56
[2025-11-06 18:10:16,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.14 | bwd: 524.15 | bwd_inner: 1.77 | bwd_allreduce: 522.25 | step: 1.64
 30%|██▉       | 1044/3507 [25:30<49:01,  1.19s/it]                                                   {'loss': 0.4921, 'learning_rate': 1.647531934135512e-05, 'epoch': 0.3}
 30%|██▉       | 1044/3507 [25:30<49:01,  1.19s/it]tensor([[-3.0469, -2.5312, -0.4980,  2.4062, -1.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:16,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.26 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4062, -5.0312, -2.4688,  1.0625, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4922,  0.6484,  2.2344, -1.2969, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -3.4375,  0.2773, -0.6250, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.4590,  2.0781,  4.3125,  0.6875, -0.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -1.9062,  1.6328, -0.0942, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -2.5781,  2.0625, -0.3887, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -2.6406, -0.8477,  3.0938, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:17,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.26
[2025-11-06 18:10:17,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.10 | bwd_microstep: 1016.06 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1014.86 | step_microstep: 1.98
[2025-11-06 18:10:17,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.39 | bwd: 1017.10 | bwd_inner: 2.05 | bwd_allreduce: 1014.92 | step: 2.06
 30%|██▉       | 1045/3507 [25:31<51:29,  1.25s/it]                                                   {'loss': 0.1505, 'learning_rate': 1.646827743252619e-05, 'epoch': 0.3}
 30%|██▉       | 1045/3507 [25:31<51:29,  1.25s/it]tensor([[-4.3125, -3.0156, -0.1914,  0.4355, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -1.8906,  1.2969,  1.6562, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -3.5781,  0.2559,  1.9141, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:17,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.25 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4219, -1.4531,  1.6172,  0.5547, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -0.8750,  2.9531, -0.6484, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -2.4688,  0.8242,  0.6875, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8281,  0.5469,  2.3906, -1.0234, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -2.0156,  1.2188, -0.1079, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:19,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:10:19,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.99 | bwd_microstep: 1610.05 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1608.88 | step_microstep: 1.73
[2025-11-06 18:10:19,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 480.26 | bwd: 1610.97 | bwd_inner: 1.92 | bwd_allreduce: 1608.92 | step: 1.81
 30%|██▉       | 1046/3507 [25:33<1:02:16,  1.52s/it]                                                     {'loss': 0.3436, 'learning_rate': 1.6461230004510508e-05, 'epoch': 0.3}
 30%|██▉       | 1046/3507 [25:33<1:02:16,  1.52s/it]tensor([[-3.6562, -3.0000, -0.2734,  2.9375, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.1562,  0.0060,  1.6406, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -2.7031,  0.1992,  1.6875, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:20,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.12 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -3.6719, -0.2891,  2.4844, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -3.8750, -0.0693,  2.0938, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -2.8594,  1.0859,  1.5391, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -1.7969,  1.8047,  0.5547, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -2.8594,  1.6250,  0.5078, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:10:20,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:10:20,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 87.84 | bwd_microstep: 131.65 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 130.68 | step_microstep: 1.62
[2025-11-06 18:10:20,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.98 | bwd: 132.51 | bwd_inner: 1.67 | bwd_allreduce: 130.71 | step: 1.70
 30%|██▉       | 1047/3507 [25:34<48:56,  1.19s/it]                                                     {'loss': 0.2171, 'learning_rate': 1.6454177063321425e-05, 'epoch': 0.3}
 30%|██▉       | 1047/3507 [25:34<48:56,  1.19s/it]tensor([[-4.7188, -2.9531,  0.9492,  1.7031, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -1.7656,  1.5547,  1.5859, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:20,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.24 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.17
tensor([[-3.3906, -2.0781,  0.7383,  1.8125, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -4.2188, -1.3203,  0.9219, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -3.8281, -0.1992,  0.6211, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6406, -0.2988,  2.3594, -0.2852, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625, -1.9531,  0.3828,  1.2422, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -3.5938, -0.6797,  1.9688, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:20,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:10:20,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.19 | bwd_microstep: 183.88 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 182.27 | step_microstep: 1.95
[2025-11-06 18:10:20,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.46 | bwd: 184.91 | bwd_inner: 2.48 | bwd_allreduce: 182.30 | step: 2.13
 30%|██▉       | 1048/3507 [25:34<41:25,  1.01s/it]                                                   {'loss': 0.559, 'learning_rate': 1.6447118614977012e-05, 'epoch': 0.3}
 30%|██▉       | 1048/3507 [25:34<41:25,  1.01s/it]tensor([[-4.0000, -2.5781,  0.8789,  2.4219, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -2.5781,  1.6406, -0.5820, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -1.7188,  2.6406,  0.4004, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:21,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.56 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -3.1562,  0.1924,  2.1562, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -3.7812, -0.0457, -0.1226, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -4.3438,  0.0078,  0.7422, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8125, -0.6602,  1.1641, -1.4062, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.6562, -4.1250, -0.4199,  0.6172, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:23,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:10:23,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.66 | bwd_microstep: 2196.51 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2195.27 | step_microstep: 1.91
[2025-11-06 18:10:23,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 449.25 | bwd: 2197.51 | bwd_inner: 2.08 | bwd_allreduce: 2195.31 | step: 1.99
 30%|██▉       | 1049/3507 [25:37<1:01:59,  1.51s/it]                                                     {'loss': 0.5167, 'learning_rate': 1.6440054665500024e-05, 'epoch': 0.3}
 30%|██▉       | 1049/3507 [25:37<1:01:59,  1.51s/it]tensor([[-3.6094, -3.3594, -1.0391,  2.9844, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:10:23,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.23 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.3750, -4.3438,  0.0850,  0.3613, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -1.2422,  1.7969, -1.4688, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9375, -2.1406,  1.3281,  1.5000, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -2.1406,  1.8125, -0.2520, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -3.2969, -0.1846,  0.8789, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -2.0938,  1.8672, -0.4219, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9375,  0.5195,  2.4844, -1.0859, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:23,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:10:23,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.40 | bwd_microstep: 55.78 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 54.17 | step_microstep: 1.29
[2025-11-06 18:10:23,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.66 | bwd: 56.83 | bwd_inner: 2.53 | bwd_allreduce: 54.20 | step: 1.37
 30%|██▉       | 1050/3507 [25:37<48:16,  1.18s/it]                                                     {'loss': 0.8623, 'learning_rate': 1.643298522091792e-05, 'epoch': 0.3}
 30%|██▉       | 1050/3507 [25:37<48:16,  1.18s/it]tensor([[-3.5781, -0.8320,  2.9219, -0.1865, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5938, -2.9219, -0.3867,  2.4375, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:24,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.32 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3125, -0.3320,  2.4531,  1.0078, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9375, -3.4062, -0.8438,  2.5938, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1406, -1.0469,  1.8984,  0.3027, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7109,  0.4297,  2.6875,  0.4707, -1.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -4.0625,  0.5508,  0.9648, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -2.8906,  1.1328,  1.0156, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:25,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:10:25,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.84 | bwd_microstep: 979.09 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 978.05 | step_microstep: 2.41
[2025-11-06 18:10:25,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.18 | bwd: 980.07 | bwd_inner: 1.85 | bwd_allreduce: 978.09 | step: 2.49
 30%|██▉       | 1051/3507 [25:39<50:12,  1.23s/it]                                                   {'loss': 0.4425, 'learning_rate': 1.642591028726285e-05, 'epoch': 0.3}
 30%|██▉       | 1051/3507 [25:39<50:12,  1.23s/it]tensor([[-4.2500, -3.4844, -0.6484,  2.2656, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:25,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.47 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9375, -2.6250,  1.7266,  0.8086, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -2.2500,  0.3242,  1.9219, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -0.4082,  2.7344, -0.8438, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -0.6406,  2.9219, -0.7109, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -1.5312,  1.8750, -1.0391, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -3.1875, -1.1562,  1.7656, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0469, -1.8516, -0.3086,  3.1875, -0.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:10:26,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:10:26,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.57 | bwd_microstep: 741.35 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 740.20 | step_microstep: 1.64
[2025-11-06 18:10:26,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 272.07 | bwd: 742.17 | bwd_inner: 1.79 | bwd_allreduce: 740.24 | step: 1.71
 30%|██▉       | 1052/3507 [25:40<47:57,  1.17s/it]                                                   {'loss': 1.509, 'learning_rate': 1.6418829870571632e-05, 'epoch': 0.3}
 30%|██▉       | 1052/3507 [25:40<47:57,  1.17s/it]tensor([[-1.9297, -0.1787,  2.2969,  1.2188, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -3.2500,  0.3164,  0.7812, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:26,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.60 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.1094, -0.4414,  3.0625, -0.1191, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -1.4453,  2.2656,  0.4883, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6875, -3.4844, -1.2891,  2.5625, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -1.7891,  2.2188, -1.0156, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -3.9375,  0.5117,  0.5586, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9062, -1.4062,  1.1172,  0.9492, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:28,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:10:28,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.97 | bwd_microstep: 1704.62 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1703.45 | step_microstep: 2.39
[2025-11-06 18:10:28,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.60 | bwd: 1705.54 | bwd_inner: 1.80 | bwd_allreduce: 1703.58 | step: 2.46
 30%|███       | 1053/3507 [25:42<59:27,  1.45s/it]                                                   {'loss': 0.3521, 'learning_rate': 1.641174397688578e-05, 'epoch': 0.3}
 30%|███       | 1053/3507 [25:42<59:27,  1.45s/it]tensor([[-4.9375, -3.7969, -0.4492,  1.8125, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:28,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.66 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.7031, -2.3125,  0.4824,  1.3203, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0312, -0.8164,  2.5938,  1.4297, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9062, -2.5781, -0.6523,  2.2500, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -0.0620,  3.0625, -0.9688, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -3.8594,  0.0967,  0.1738, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -2.1719,  1.2969,  0.4785, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1406, -0.2500,  2.9375, -1.5000, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:28,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:10:28,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.54 | bwd_microstep: 120.51 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 119.06 | step_microstep: 2.14
[2025-11-06 18:10:28,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.22 | bwd: 121.59 | bwd_inner: 2.36 | bwd_allreduce: 119.11 | step: 2.22
 30%|███       | 1054/3507 [25:42<47:03,  1.15s/it]                                                   {'loss': 0.3757, 'learning_rate': 1.640465261225147e-05, 'epoch': 0.3}
 30%|███       | 1054/3507 [25:42<47:03,  1.15s/it]tensor([[-1.1797,  0.9766,  2.2656, -1.1016, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:10:28,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.50 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9219, -2.7188, -0.5898,  3.3281, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.8125, -0.4473,  1.6875, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7500, -0.2139,  2.2031, -1.3125, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3125, -1.0938,  1.3828,  2.2188, -1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -0.6875,  2.9219, -1.1875, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688,  0.2891,  2.7969, -1.3125, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -2.6719,  1.9531, -1.2188, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:30,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:10:30,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.26 | bwd_microstep: 1463.83 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1462.64 | step_microstep: 2.94
[2025-11-06 18:10:30,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 239.77 | bwd: 1464.84 | bwd_inner: 2.03 | bwd_allreduce: 1462.68 | step: 3.03
 30%|███       | 1055/3507 [25:44<54:11,  1.33s/it]                                                   {'loss': 0.3119, 'learning_rate': 1.6397555782719556e-05, 'epoch': 0.3}
 30%|███       | 1055/3507 [25:44<54:11,  1.33s/it]tensor([[-3.7500, -0.8281,  3.3438,  0.4805, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -3.3125,  0.3691,  0.2109, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:30,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.73 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -3.6875, -0.6094,  0.3027, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0625,  0.8945,  3.2969, -1.5234, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1719, -0.2832,  2.2031, -2.2969, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.6250, -1.7578,  1.3516,  0.5586, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -2.9062,  0.4336,  1.5000, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719, -0.7852,  2.3594,  1.7578, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:31,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:10:31,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.96 | bwd_microstep: 645.73 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 644.51 | step_microstep: 2.42
[2025-11-06 18:10:31,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.72 | bwd: 646.77 | bwd_inner: 2.08 | bwd_allreduce: 644.56 | step: 2.50
 30%|███       | 1056/3507 [25:45<50:09,  1.23s/it]                                                   {'loss': 1.0354, 'learning_rate': 1.639045349434554e-05, 'epoch': 0.3}
 30%|███       | 1056/3507 [25:45<50:09,  1.23s/it]tensor([[-5.4688, -3.5469,  0.2988, -0.0591, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:31,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.52 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1406, -1.8438, -0.4219,  2.2969, -0.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.3750, -1.8906,  1.8750, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -1.8750,  2.1875,  0.2969, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -1.8203,  2.5625, -1.1094, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4844, -2.2969, -0.2031,  3.7188, -0.9570]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -1.5469,  2.1094, -0.1748, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -4.7812, -0.9492,  1.1172, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:10:33,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:10:33,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.25 | bwd_microstep: 1661.04 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 1659.60 | step_microstep: 1.52
[2025-11-06 18:10:33,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.79 | bwd: 1662.01 | bwd_inner: 2.25 | bwd_allreduce: 1659.63 | step: 1.60
 30%|███       | 1057/3507 [25:47<59:43,  1.46s/it]                                                   {'loss': 0.1494, 'learning_rate': 1.63833457531896e-05, 'epoch': 0.3}
 30%|███       | 1057/3507 [25:47<59:43,  1.46s/it]tensor([[-0.2734,  0.3047,  2.2188,  5.0938,  0.6367]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -3.7344, -0.2129, -1.4141, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:33,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.77 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-8.3750, -6.9688, -2.6406, -0.6484, -6.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -1.9219,  2.2031, -0.2393, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -3.8438, -0.4980,  1.3125, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.8594, -0.8750,  1.2891, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938, -3.2188, -0.2207,  2.4531, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -3.7969, -1.1875,  2.1250, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:10:34,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:10:34,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.62 | bwd_microstep: 611.71 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 610.58 | step_microstep: 2.05
[2025-11-06 18:10:34,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.42 | bwd: 612.74 | bwd_inner: 1.97 | bwd_allreduce: 610.63 | step: 2.14
 30%|███       | 1058/3507 [25:48<54:03,  1.32s/it]                                                   {'loss': 0.355, 'learning_rate': 1.6376232565316557e-05, 'epoch': 0.3}
 30%|███       | 1058/3507 [25:48<54:03,  1.32s/it]tensor([[-2.7656, -1.9688,  0.7188,  3.5312, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.5156,  0.0114,  1.6406, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -2.2812,  2.0625, -0.4004, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:34,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.28 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-3.9531, -3.2344, -0.4004,  2.4375, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -2.0781,  1.6953, -1.9297, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1406,  0.1240,  2.7188,  0.2715, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -2.3906,  1.3125,  1.0156, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6875,  0.5234,  3.1406,  0.9961, -1.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:36,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:10:36,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.86 | bwd_microstep: 1634.52 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1633.43 | step_microstep: 2.09
[2025-11-06 18:10:36,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 445.16 | bwd: 1635.46 | bwd_inner: 1.85 | bwd_allreduce: 1633.48 | step: 2.19
 30%|███       | 1059/3507 [25:50<1:03:48,  1.56s/it]                                                     {'loss': 0.1766, 'learning_rate': 1.6369113936795876e-05, 'epoch': 0.3}
 30%|███       | 1059/3507 [25:50<1:03:48,  1.56s/it]tensor([[-4.6875, -3.6094, -0.3867,  1.4375, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:36,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.38 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.4062,  0.4180,  2.5781,  1.2734, -1.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625, -0.1641,  3.2344, -0.9609, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -2.3281,  0.8359,  1.2656, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0000, -1.0000,  2.2188,  1.3281, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -3.9531, -2.5156,  0.7500, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -2.9688,  0.5820,  0.5469, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0938,  0.4199,  3.5469,  0.7695, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:37,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 18:10:37,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.40 | bwd_microstep: 336.38 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 335.30 | step_microstep: 1.74
[2025-11-06 18:10:37,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.81 | bwd: 337.33 | bwd_inner: 1.86 | bwd_allreduce: 335.33 | step: 1.81
 30%|███       | 1060/3507 [25:51<53:39,  1.32s/it]                                                     {'loss': 0.3535, 'learning_rate': 1.6361989873701668e-05, 'epoch': 0.3}
 30%|███       | 1060/3507 [25:51<53:39,  1.32s/it]tensor([[-2.8125, -0.3477,  2.5156, -0.4238, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -4.9062, -1.8828,  1.8359, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -2.9688, -0.0569,  2.0469, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -2.9062,  0.2891,  1.8125, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:37,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.20 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.9531, -3.3438, -0.5508,  2.5000, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.1094,  0.0090,  2.2969, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125, -1.7734,  1.2109,  3.4844, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -3.6562, -0.8711,  1.6562, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:38,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:10:38,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.10 | bwd_microstep: 426.56 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 425.37 | step_microstep: 2.30
[2025-11-06 18:10:38,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.32 | bwd: 427.53 | bwd_inner: 1.98 | bwd_allreduce: 425.42 | step: 2.38
 30%|███       | 1061/3507 [25:52<47:53,  1.17s/it]                                                   {'loss': 0.4244, 'learning_rate': 1.6354860382112692e-05, 'epoch': 0.3}
 30%|███       | 1061/3507 [25:52<47:53,  1.17s/it]tensor([[-1.9453,  0.1211,  3.1719,  2.0156, -1.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:38,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.55 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3125, -2.6875,  1.8281, -0.2266, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -1.4375,  2.1094, -0.7578, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2188, -1.0547,  3.0625, -1.5312, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -2.5781,  1.3672,  0.3184, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -5.9375, -3.6094, -0.1973, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.1719, -0.3789,  2.5312, -1.6250, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -1.9375,  1.5547,  1.6328, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:40,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:10:40,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.09 | bwd_microstep: 1621.06 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1619.88 | step_microstep: 2.40
[2025-11-06 18:10:40,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.65 | bwd: 1622.27 | bwd_inner: 2.20 | bwd_allreduce: 1619.93 | step: 2.49
 30%|███       | 1062/3507 [25:54<58:24,  1.43s/it]                                                   {'loss': 0.6859, 'learning_rate': 1.6347725468112316e-05, 'epoch': 0.3}
 30%|███       | 1062/3507 [25:54<58:24,  1.43s/it]tensor([[-5.1250, -3.9375, -0.2227,  2.0781, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -3.7812,  1.0469,  0.1108, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:40,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.77 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -3.7344, -0.6211,  1.9766, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3984,  1.0078,  2.7031, -1.1641, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.3672, -0.3887,  2.0781,  4.0938, -0.3965]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.3125, -0.0071,  2.1406, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -2.9688,  0.3379,  1.3984, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -3.4844, -0.6992,  2.8438, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:10:41,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:10:41,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.42 | bwd_microstep: 499.52 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 498.38 | step_microstep: 1.88
[2025-11-06 18:10:41,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.21 | bwd: 500.51 | bwd_inner: 1.97 | bwd_allreduce: 498.41 | step: 1.97
 30%|███       | 1063/3507 [25:55<52:05,  1.28s/it]                                                   {'loss': 0.5142, 'learning_rate': 1.6340585137788557e-05, 'epoch': 0.3}
 30%|███       | 1063/3507 [25:55<52:05,  1.28s/it]tensor([[-2.5156, -0.0284,  1.8984, -1.7266, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:10:41,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.40 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.0312, -2.8906,  1.4453,  0.8359, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.5312, -0.6641,  2.8750, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7656, -2.1250,  0.5352, -0.1196, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.5000, -4.1562, -0.2559,  1.5547, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.7031, -0.4199,  1.6953, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -4.2812, -0.4082,  0.3301, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -0.9531,  2.8281, -1.4844, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:44,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:10:44,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.25 | bwd_microstep: 2686.11 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 2684.80 | step_microstep: 1.92
[2025-11-06 18:10:44,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.67 | bwd: 2687.11 | bwd_inner: 2.13 | bwd_allreduce: 2684.84 | step: 2.02
 30%|███       | 1064/3507 [25:58<1:14:34,  1.83s/it]                                                     {'loss': 0.9777, 'learning_rate': 1.633343939723404e-05, 'epoch': 0.3}
 30%|███       | 1064/3507 [25:58<1:14:34,  1.83s/it]tensor([[-4.1875, -2.2656,  1.4688,  1.0859, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:44,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.99 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.8906, -1.7188,  0.5742,  1.0234, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -2.8594,  0.9648,  0.3438, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -1.5234,  2.6719, -1.2812, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0938, -1.9297, -0.3164,  2.9531, -0.7773]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -0.6445,  2.8438, -0.1338, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -4.3750, -2.2344,  1.7188, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -2.7188,  0.0806,  1.5781, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:10:45,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:10:45,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.94 | bwd_microstep: 564.58 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 563.31 | step_microstep: 1.65
[2025-11-06 18:10:45,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.96 | bwd: 565.54 | bwd_inner: 2.04 | bwd_allreduce: 563.36 | step: 1.74
 30%|███       | 1065/3507 [25:59<1:03:44,  1.57s/it]                                                     {'loss': 0.331, 'learning_rate': 1.6326288252546008e-05, 'epoch': 0.3}
 30%|███       | 1065/3507 [25:59<1:03:44,  1.57s/it]tensor([[-2.3594, -2.2656, -0.1533,  4.2500, -0.6914]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.3125,  1.1406,  0.6133, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:45,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.23 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4688, -3.9844, -0.0830,  1.2578, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2812, -2.3281,  0.2363,  2.0625, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -2.5000,  0.3633,  2.3125, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969,  0.0089,  3.2656,  1.1719, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6875, -0.4316,  2.2969, -0.4102, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -4.1562, -2.1250,  1.6016, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:10:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:10:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.83 | bwd_microstep: 34.85 | bwd_inner_microstep: 1.84 | bwd_allreduce_microstep: 32.93 | step_microstep: 1.56
[2025-11-06 18:10:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.08 | bwd: 35.87 | bwd_inner: 2.74 | bwd_allreduce: 32.98 | step: 1.65
 30%|███       | 1066/3507 [25:59<49:49,  1.22s/it]                                                     {'loss': 0.4213, 'learning_rate': 1.6319131709826325e-05, 'epoch': 0.3}
 30%|███       | 1066/3507 [25:59<49:49,  1.22s/it]tensor([[-2.4062, -1.6562,  1.1484,  4.1250, -1.0234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0156, -0.4316,  2.9531,  0.3809, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:45,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.57 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7188, -0.9844,  1.9766, -1.4375, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.8125, -2.5000,  0.7734,  2.1406, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -0.8008,  2.5156, -1.6719, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3594, -1.3984, -0.1826,  3.3594, -0.1099]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-5.6875, -3.2812,  1.3594,  0.1338, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2656, -2.2344, -1.3359,  1.1719, -1.0234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:10:48,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.18 | optimizer_step: 0.29
[2025-11-06 18:10:48,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.24 | bwd_microstep: 2210.06 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 2209.06 | step_microstep: 2.11
[2025-11-06 18:10:48,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.84 | bwd: 2210.92 | bwd_inner: 1.69 | bwd_allreduce: 2209.10 | step: 2.18
 30%|███       | 1067/3507 [26:02<1:06:56,  1.65s/it]                                                     {'loss': 1.3164, 'learning_rate': 1.6311969775181447e-05, 'epoch': 0.3}
 30%|███       | 1067/3507 [26:02<1:06:56,  1.65s/it]tensor([[-3.3906, -2.5000,  0.1025,  2.0156, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.0312, -0.1953,  1.7188, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:48,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.01 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2031, -1.9766,  1.3594,  3.4844, -1.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -4.5312, -0.7930,  2.0781, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -3.5156, -1.3281,  2.6406, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6875, -3.2188,  0.2197,  1.1016, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -4.1875, -0.9648,  1.3203, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2031, -1.0859,  2.0156,  0.7031, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:10:48,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:10:48,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.28 | bwd_microstep: 231.50 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 230.41 | step_microstep: 1.68
[2025-11-06 18:10:48,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.31 | bwd: 232.37 | bwd_inner: 1.78 | bwd_allreduce: 230.45 | step: 1.76
 30%|███       | 1068/3507 [26:02<54:00,  1.33s/it]                                                     {'loss': 0.1629, 'learning_rate': 1.6304802454722447e-05, 'epoch': 0.3}
 30%|███       | 1068/3507 [26:02<54:00,  1.33s/it]tensor([[-2.9219, -0.0273,  2.9062, -1.3750, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -1.7891,  1.7500, -0.7383, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -4.1250, -1.4922,  1.9609, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.3281,  0.7188,  1.3203, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:49,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.60 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -1.3906,  2.4062, -0.5234, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.3594,  0.8516,  1.1797, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -1.8750,  1.5625,  1.1016, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4375, -0.5039,  3.2656, -0.6836, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:51,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.37 | optimizer_step: 0.27
[2025-11-06 18:10:51,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.55 | bwd_microstep: 2432.80 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 2431.59 | step_microstep: 2.75
[2025-11-06 18:10:51,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 446.18 | bwd: 2433.77 | bwd_inner: 2.00 | bwd_allreduce: 2431.64 | step: 2.83
 30%|███       | 1069/3507 [26:05<1:13:24,  1.81s/it]                                                     {'loss': 0.234, 'learning_rate': 1.6297629754564973e-05, 'epoch': 0.3}
 30%|███       | 1069/3507 [26:05<1:13:24,  1.81s/it]tensor([[-3.1875, -2.0312,  0.5312,  1.7500, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.7031, -0.9141,  2.1406, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375, -2.4844,  0.5430,  2.8438, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7656, -0.4355,  2.2969, -0.5781, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:52,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.20 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-2.4531,  0.0270,  2.7500,  0.0898, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9922, -1.3047,  1.0625,  3.9062, -0.7383]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -3.3906,  0.9922,  0.8672, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -2.5000,  1.5469,  0.2598, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:10:52,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:10:52,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.72 | bwd_microstep: 48.81 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 47.39 | step_microstep: 1.37
[2025-11-06 18:10:52,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 434.95 | bwd: 49.57 | bwd_inner: 2.03 | bwd_allreduce: 47.42 | step: 1.43
 31%|███       | 1070/3507 [26:06<57:43,  1.42s/it]                                                     {'loss': 0.5424, 'learning_rate': 1.6290451680829283e-05, 'epoch': 0.31}
 31%|███       | 1070/3507 [26:06<57:43,  1.42s/it]tensor([[-1.9609, -2.1094, -0.4297,  4.2188, -0.3613]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:52,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.52 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -4.5312, -1.4141,  1.3516, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -5.0312, -1.8281, -0.2910, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0781, -0.8516,  2.1875,  0.3535, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844, -2.9375, -1.1484,  2.6250, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -3.2188, -0.0171,  1.6641, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -2.7656,  1.4844, -1.4219, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6953,  0.5859,  2.3594, -0.5977, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:55,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:10:55,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.13 | bwd_microstep: 2492.32 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 2491.06 | step_microstep: 1.73
[2025-11-06 18:10:55,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 262.67 | bwd: 2493.33 | bwd_inner: 2.10 | bwd_allreduce: 2491.10 | step: 1.82
 31%|███       | 1071/3507 [26:09<1:14:18,  1.83s/it]                                                     {'loss': 0.1311, 'learning_rate': 1.6283268239640203e-05, 'epoch': 0.31}
 31%|███       | 1071/3507 [26:09<1:14:18,  1.83s/it]tensor([[-3.4375, -1.0625,  0.9102, -2.9688, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:10:55,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 64.12 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2188, -2.8438,  0.3223,  1.3047, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -3.4062,  0.3730, -1.1875, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531, -1.0078,  2.2969, -0.1455, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -3.6250, -0.4707,  2.3906, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5625, -0.8477,  2.4375, -0.9922, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -4.0312, -1.1719,  2.4531, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.5469, -0.4199,  1.6328, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:10:55,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.14
[2025-11-06 18:10:55,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.84 | bwd_microstep: 97.57 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 96.32 | step_microstep: 1.54
[2025-11-06 18:10:55,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 257.97 | bwd: 98.40 | bwd_inner: 1.92 | bwd_allreduce: 96.36 | step: 1.61
 31%|███       | 1072/3507 [26:09<56:38,  1.40s/it]                                                     {'loss': 0.6968, 'learning_rate': 1.6276079437127155e-05, 'epoch': 0.31}
 31%|███       | 1072/3507 [26:09<56:38,  1.40s/it]tensor([[-4.7188, -4.3125, -1.6094,  1.8906, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:10:55,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.60 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.5312, -1.7422,  1.4531,  0.9297, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -4.3438, -0.2988,  2.2031, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.3750,  1.8125, -1.3047, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -1.6875,  1.8203, -0.3164, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -4.9062, -0.8359,  0.7852, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -3.8594,  1.0547, -0.9766, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -1.8672,  1.2734, -0.8633, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:10:58,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.25
[2025-11-06 18:10:58,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.29 | bwd_microstep: 2890.51 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2889.27 | step_microstep: 1.85
[2025-11-06 18:10:58,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.93 | bwd: 2891.42 | bwd_inner: 1.94 | bwd_allreduce: 2889.32 | step: 1.95
 31%|███       | 1073/3507 [26:12<1:21:17,  2.00s/it]                                                     {'loss': 0.1645, 'learning_rate': 1.6268885279424126e-05, 'epoch': 0.31}
 31%|███       | 1073/3507 [26:12<1:21:17,  2.00s/it]tensor([[-3.4062, -1.9141,  0.9922,  1.3984, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:10:59,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.22 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[-4.4062, -3.8125, -0.9141,  2.4062, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -5.5625, -2.5938,  1.0469, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.4844, -0.5703,  2.2812, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8750, -0.6250,  3.2500, -1.5312, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -2.9062,  0.8320,  0.8398, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -2.3594,  0.1631,  2.3438, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -0.6016,  3.4531, -0.9062, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:10:59,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:10:59,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.45 | bwd_microstep: 235.25 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 234.23 | step_microstep: 1.40
[2025-11-06 18:10:59,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.65 | bwd: 236.20 | bwd_inner: 1.81 | bwd_allreduce: 234.27 | step: 1.46
 31%|███       | 1074/3507 [26:13<1:03:28,  1.57s/it]                                                     {'loss': 0.2515, 'learning_rate': 1.6261685772669675e-05, 'epoch': 0.31}
 31%|███       | 1074/3507 [26:13<1:03:28,  1.57s/it]tensor([[-3.5781, -2.1094,  0.9844,  1.6094, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.3438, -3.5156,  0.7305,  1.2812, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([3], device='cuda:3')
tensor([[-3.0312, -1.0234,  1.6172, -0.4609, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -4.2500, -2.0156,  0.6250, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:10:59,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.56 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2500, -0.1436,  3.1875, -1.4922, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -3.1562, -0.2637,  2.1562, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -3.9688, -1.5703,  2.0469, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500, -3.0781, -0.3379,  2.6562, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:01,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.30 | optimizer_step: 0.31
[2025-11-06 18:11:01,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.59 | bwd_microstep: 1170.35 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1169.12 | step_microstep: 3.11
[2025-11-06 18:11:01,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.18 | bwd: 1171.24 | bwd_inner: 1.91 | bwd_allreduce: 1169.19 | step: 3.19
 31%|███       | 1075/3507 [26:14<1:03:48,  1.57s/it]                                                     {'loss': 0.9348, 'learning_rate': 1.6254480923006924e-05, 'epoch': 0.31}
 31%|███       | 1075/3507 [26:14<1:03:48,  1.57s/it]tensor([[-2.4844, -1.7422,  0.7422,  3.2656, -1.1953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -2.5156,  0.3711,  1.8828, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -1.9297,  1.5938, -1.9688, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:01,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 286.89 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.7188, -4.4062, -0.3105,  1.7188, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -2.0312,  0.8125,  1.0703, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -4.4375, -0.4082,  0.1758, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7812, -2.1250,  0.9766,  1.2578, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -2.7656,  0.5781,  2.9531, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:01,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:11:01,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.68 | bwd_microstep: 36.59 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 35.52 | step_microstep: 1.51
[2025-11-06 18:11:01,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.61 | bwd: 37.53 | bwd_inner: 1.83 | bwd_allreduce: 35.56 | step: 1.61
 31%|███       | 1076/3507 [26:15<51:07,  1.26s/it]                                                     {'loss': 0.3203, 'learning_rate': 1.6247270736583555e-05, 'epoch': 0.31}
 31%|███       | 1076/3507 [26:15<51:07,  1.26s/it]tensor([[-2.9688,  0.1016,  3.0938, -1.6641, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1719, -3.1562, -1.7109,  1.5234, -1.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:01,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.92 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0000, -2.7656,  1.0312, -0.1406, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -0.7578,  2.9844, -1.3672, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -4.3125, -1.4844,  2.1562, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3906,  0.6094,  3.7344, -1.1016, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5000, -1.5469,  2.0938,  1.6875, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -1.7500,  2.9219, -1.1016, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:03,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:11:03,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.41 | bwd_microstep: 1568.17 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1567.05 | step_microstep: 2.15
[2025-11-06 18:11:03,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.35 | bwd: 1569.17 | bwd_inner: 1.94 | bwd_allreduce: 1567.09 | step: 2.23
 31%|███       | 1077/3507 [26:17<58:59,  1.46s/it]                                                   {'loss': 0.1939, 'learning_rate': 1.6240055219551805e-05, 'epoch': 0.31}
 31%|███       | 1077/3507 [26:17<58:59,  1.46s/it]tensor([[-4.6562, -3.2812,  0.1177,  1.2812, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6250, -1.7266,  1.1250,  3.3438, -1.3516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -3.9219,  0.5000,  0.8789, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:03,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.12 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1250, -3.0312,  1.0391,  0.5039, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.0938,  1.5078,  1.0078, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -4.4688, -0.8203,  0.7148, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -1.6562,  1.4609,  0.9961, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562, -3.0469, -0.2275,  2.9062, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:11:04,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:11:04,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.19 | bwd_microstep: 60.88 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 59.94 | step_microstep: 1.91
[2025-11-06 18:11:04,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.33 | bwd: 62.00 | bwd_inner: 1.89 | bwd_allreduce: 59.98 | step: 2.00
 31%|███       | 1078/3507 [26:17<47:06,  1.16s/it]                                                   {'loss': 0.3996, 'learning_rate': 1.6232834378068454e-05, 'epoch': 0.31}
 31%|███       | 1078/3507 [26:17<47:06,  1.16s/it]tensor([[0.3379, 0.2793, 0.7852, 3.5469, 1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.2500, -1.6562,  1.5625, -1.3828, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -3.9062, -0.4707,  1.0703, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250, -0.3066,  2.7500, -1.2969, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:04,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.87 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9844, -3.7344, -1.4531,  2.2188, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094, -0.8711,  2.3750,  2.5000, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -4.2188, -0.9102,  0.9336, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -4.1250, -0.8789,  1.5938, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:06,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:11:06,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.96 | bwd_microstep: 1892.36 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 1891.07 | step_microstep: 3.27
[2025-11-06 18:11:06,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 467.86 | bwd: 1893.50 | bwd_inner: 2.25 | bwd_allreduce: 1891.11 | step: 3.34
 31%|███       | 1079/3507 [26:20<1:02:11,  1.54s/it]                                                     {'loss': 0.506, 'learning_rate': 1.6225608218294832e-05, 'epoch': 0.31}
 31%|███       | 1079/3507 [26:20<1:02:11,  1.54s/it]tensor([[-5.5938, -2.6875,  1.1641, -2.7188, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:06,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.00 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.5156,  0.3535,  2.6875, -1.7344, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.2031, -1.4219,  0.7188, -0.4922, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.3906, -3.0156, -0.5234,  3.0781, -1.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -3.4219,  0.9805,  1.4766, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8281, -1.5156,  1.8359, -0.1084, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875, -1.2969,  1.8594, -0.6289, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.4375,  1.6797,  1.2500, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:11:06,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:11:06,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.96 | bwd_microstep: 88.22 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 86.80 | step_microstep: 2.08
[2025-11-06 18:11:06,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.98 | bwd: 89.12 | bwd_inner: 2.09 | bwd_allreduce: 86.84 | step: 2.18
 31%|███       | 1080/3507 [26:20<48:59,  1.21s/it]                                                     {'loss': 0.9117, 'learning_rate': 1.62183767463968e-05, 'epoch': 0.31}
 31%|███       | 1080/3507 [26:20<48:59,  1.21s/it]tensor([[-4.3125, -3.1094,  0.2490,  2.1406, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3125, -3.0156, -0.7617,  2.7344, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -2.7031,  0.7305, -0.3887, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -2.6562,  1.5312,  0.4434, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:07,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.86 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.4375, -4.1875, -1.6484,  2.2656, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -2.0312,  2.2969, -0.9102, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -3.5625, -1.0156,  2.2969, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000, -1.0938,  2.0938,  3.3125, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:07,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.18 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:11:07,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.91 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 3.09
[2025-11-06 18:11:07,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.79 | bwd: 2.87 | bwd_inner: 1.94 | bwd_allreduce: 0.80 | step: 3.18
 31%|███       | 1081/3507 [26:21<40:24,  1.00it/s]                                                   {'loss': 0.1548, 'learning_rate': 1.621113996854476e-05, 'epoch': 0.31}
 31%|███       | 1081/3507 [26:21<40:24,  1.00it/s]tensor([[-4.0000, -3.0938, -0.3477,  1.6562, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:07,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.47 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.7656, -2.2188,  0.9141,  1.1562, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0469,  0.6016,  3.8750,  0.7109, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -3.0156, -0.0415,  2.1875, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.9062, -0.9805,  2.5781, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2188,  0.7344,  3.0469, -1.7734, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4062, -2.7344,  0.6445,  0.8398, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -4.2812,  0.6797, -0.2910, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:11:10,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.18 | optimizer_step: 0.30
[2025-11-06 18:11:10,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.69 | bwd_microstep: 2379.69 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2378.43 | step_microstep: 2.21
[2025-11-06 18:11:10,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 255.18 | bwd: 2380.69 | bwd_inner: 2.05 | bwd_allreduce: 2378.49 | step: 2.32
 31%|███       | 1082/3507 [26:23<1:00:37,  1.50s/it]                                                     {'loss': 0.7232, 'learning_rate': 1.620389789091364e-05, 'epoch': 0.31}
 31%|███       | 1082/3507 [26:23<1:00:37,  1.50s/it]tensor([[-4.8125, -3.8750, -0.5859,  1.6875, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9922,  0.6953,  3.5156,  0.4199, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:10,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.60 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.4531, -0.7656,  2.0625, -1.4688, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -2.0156,  1.8359, -0.6445, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -2.0156,  1.2812, -0.3633, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-1.6719,  0.9023,  2.8906, -1.2266, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([2], device='cuda:3')
tensor([1], device='cuda:2')
tensor([[-4.3125, -3.9688, -1.4141,  2.1406, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -0.5898,  2.7969, -0.0664, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:10,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:11:10,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.13 | bwd_microstep: 1.78 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.69 | step_microstep: 1.54
[2025-11-06 18:11:10,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.76 | bwd: 2.72 | bwd_inner: 1.83 | bwd_allreduce: 0.74 | step: 1.64
 31%|███       | 1083/3507 [26:24<47:20,  1.17s/it]                                                     {'loss': 0.3636, 'learning_rate': 1.619665051968288e-05, 'epoch': 0.31}
 31%|███       | 1083/3507 [26:24<47:20,  1.17s/it]tensor([[-2.5000, -0.4160,  2.1094, -0.0713, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:11:10,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7500, -2.4062,  1.5859,  0.1387, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -3.1719,  1.2578,  0.4512, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -3.2656, -0.6289,  1.8047, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3125, -0.6914,  2.2031, -1.2812, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -2.9219,  0.5625,  1.5234, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -4.1250, -0.2500, -0.8672, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594,  0.1455,  3.0000, -1.7969, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:12,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:11:12,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.59 | bwd_microstep: 1877.67 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1876.66 | step_microstep: 1.85
[2025-11-06 18:11:12,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.81 | bwd: 1878.59 | bwd_inner: 1.76 | bwd_allreduce: 1876.70 | step: 1.92
 31%|███       | 1084/3507 [26:26<1:00:49,  1.51s/it]                                                     {'loss': 0.549, 'learning_rate': 1.6189397861036448e-05, 'epoch': 0.31}
 31%|███       | 1084/3507 [26:26<1:00:49,  1.51s/it]tensor([[-4.7188, -2.4062,  1.0391, -1.0859, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -3.0156,  0.6055,  3.1719, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:12,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.37 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6250, -3.8906, -0.6445,  2.6094, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -1.6719,  1.9297,  0.4023, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -3.8438, -1.1641,  1.8828, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -1.1016,  2.1406, -1.4062, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -1.7500,  1.2891, -0.8633, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -1.9297,  1.8359, -2.6562, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:13,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.25 | optimizer_step: 0.20
[2025-11-06 18:11:13,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.95 | bwd_microstep: 215.00 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 213.88 | step_microstep: 2.16
[2025-11-06 18:11:13,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.34 | bwd: 215.91 | bwd_inner: 1.85 | bwd_allreduce: 213.92 | step: 2.23
 31%|███       | 1085/3507 [26:27<49:11,  1.22s/it]                                                     {'loss': 0.2949, 'learning_rate': 1.6182139921162817e-05, 'epoch': 0.31}
 31%|███       | 1085/3507 [26:27<49:11,  1.22s/it]tensor([[-3.1719, -0.3086,  2.5781, -1.8828, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -3.8906, -0.4160,  1.1484, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -4.0938, -0.3145,  1.4531, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:13,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.72 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.4219, -0.8516,  2.1875, -0.6172, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -2.0312,  1.0938,  1.3125, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -2.3594,  2.0938, -1.7422, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -1.8281,  1.4453, -0.3652, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.4531,  0.6562,  1.3203, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:20,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.20 | optimizer_step: 0.31
[2025-11-06 18:11:20,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.04 | bwd_microstep: 6622.24 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 6621.25 | step_microstep: 2.48
[2025-11-06 18:11:20,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.78 | bwd: 6623.09 | bwd_inner: 1.66 | bwd_allreduce: 6621.30 | step: 2.55
 31%|███       | 1086/3507 [26:34<1:59:30,  2.96s/it]                                                     {'loss': 0.4517, 'learning_rate': 1.617487670625497e-05, 'epoch': 0.31}
 31%|███       | 1086/3507 [26:34<1:59:30,  2.96s/it]tensor([[-3.0625, -2.1094,  0.6367,  2.7969, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -2.8438,  0.5977,  1.0469, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:20,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.30 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6562, -3.5469, -1.3359,  2.6719, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0625, -2.9219, -1.2578,  1.8125, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.8281, -0.5508,  3.1562, -1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -4.4062, -1.1641,  1.2734, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2812, -3.0000, -1.1875,  1.7188, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.6797,  0.8906,  2.7500, -1.4297, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:11:21,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:11:21,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.50 | bwd_microstep: 292.49 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 291.49 | step_microstep: 2.02
[2025-11-06 18:11:21,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.82 | bwd: 293.34 | bwd_inner: 1.70 | bwd_allreduce: 291.52 | step: 2.09
 31%|███       | 1087/3507 [26:34<1:31:38,  2.27s/it]                                                     {'loss': 0.8119, 'learning_rate': 1.6167608222510395e-05, 'epoch': 0.31}
 31%|███       | 1087/3507 [26:34<1:31:38,  2.27s/it]tensor([[-4.3750, -2.8750,  0.2119,  0.5977, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:21,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.59 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0625, -2.2969,  1.0156,  0.6367, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0156, -2.6094, -0.2422,  3.1875, -1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -5.0938, -2.3125,  1.0234, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -2.1562,  1.2891,  1.4922, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -3.2344, -0.0698,  2.2812, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -2.1719,  1.3906, -1.7656, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -2.8125,  1.4062,  0.2344, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:21,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:11:21,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.46 | bwd_microstep: 159.33 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 158.18 | step_microstep: 1.73
[2025-11-06 18:11:21,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.07 | bwd: 160.27 | bwd_inner: 1.94 | bwd_allreduce: 158.21 | step: 1.81
 31%|███       | 1088/3507 [26:35<1:10:21,  1.75s/it]                                                     {'loss': 0.2864, 'learning_rate': 1.616033447613106e-05, 'epoch': 0.31}
 31%|███       | 1088/3507 [26:35<1:10:21,  1.75s/it]tensor([[-4.0312, -3.2812, -0.3203,  2.7188, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:21,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.72 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.9688, -0.0918,  2.2344,  1.1719, -1.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -3.1406, -1.5469,  2.2031, -1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -2.6875,  1.4453, -0.2158, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469,  0.2461,  2.5469, -1.5391, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7656, -0.9180,  2.7812, -0.5195, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5625, -0.1006,  2.8281, -0.1660, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7344, -3.0156, -0.1797,  2.6562, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:11:21,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:11:21,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.47 | bwd_microstep: 27.61 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 26.37 | step_microstep: 1.56
[2025-11-06 18:11:21,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.22 | bwd: 28.60 | bwd_inner: 2.07 | bwd_allreduce: 26.41 | step: 1.64
 31%|███       | 1089/3507 [26:35<54:20,  1.35s/it]                                                     {'loss': 0.5526, 'learning_rate': 1.6153055473323447e-05, 'epoch': 0.31}
 31%|███       | 1089/3507 [26:35<54:20,  1.35s/it]tensor([[-3.7500, -3.3594, -1.0469,  2.1250, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:22,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.21 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0312, -2.9844,  1.2266,  1.0938, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -3.6094,  0.3105,  1.2969, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -3.2812,  0.1826,  0.1387, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -0.8086,  3.0312, -1.3828, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5625, -1.9062,  0.8008,  3.8750, -1.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -2.5156,  0.9180,  0.4355, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.3125, -0.4453,  2.4844, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:24,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:11:24,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.13 | bwd_microstep: 2345.19 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 2343.81 | step_microstep: 1.85
[2025-11-06 18:11:24,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.36 | bwd: 2346.10 | bwd_inner: 2.13 | bwd_allreduce: 2343.85 | step: 1.92
 31%|███       | 1090/3507 [26:38<1:10:37,  1.75s/it]                                                     {'loss': 0.364, 'learning_rate': 1.6145771220298502e-05, 'epoch': 0.31}
 31%|███       | 1090/3507 [26:38<1:10:37,  1.75s/it]tensor([[-3.1719,  0.1021,  3.1719, -2.1875, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7031, -1.7422,  0.5547,  1.8359, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -4.3125, -2.0312,  1.8125, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.1406,  0.2617,  1.0781, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:24,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.59 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.7812, -4.3750,  0.4473, -0.4180, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1250, -0.0302,  2.4219, -2.6875, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -2.0938,  1.4688,  0.8477, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6562, -4.5000, -0.4062, -1.3828, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:25,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:11:25,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.72 | bwd_microstep: 41.98 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 40.79 | step_microstep: 1.65
[2025-11-06 18:11:25,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.33 | bwd: 42.78 | bwd_inner: 1.84 | bwd_allreduce: 40.82 | step: 1.72
 31%|███       | 1091/3507 [26:38<54:26,  1.35s/it]                                                     {'loss': 0.2532, 'learning_rate': 1.613848172327166e-05, 'epoch': 0.31}
 31%|███       | 1091/3507 [26:38<54:26,  1.35s/it]tensor([[-2.9688,  0.0454,  3.3750, -1.0469, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1094, -0.8281,  1.8047, -0.7422, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -4.1250, -1.4766,  1.3672, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:11:25,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.26 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([3], device='cuda:3')
tensor([[-5.0000, -4.1875, -1.1250,  1.6406, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7969,  0.1025,  2.6094, -1.7344, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -3.5156,  0.6719,  0.9453, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -2.9688, -0.7969,  0.0879, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -3.5000, -0.1270,  1.6562, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:26,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:11:26,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.24 | bwd_microstep: 392.72 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 391.37 | step_microstep: 1.62
[2025-11-06 18:11:26,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 508.52 | bwd: 393.83 | bwd_inner: 2.29 | bwd_allreduce: 391.41 | step: 1.71
 31%|███       | 1092/3507 [26:40<52:15,  1.30s/it]                                                   {'loss': 0.4267, 'learning_rate': 1.6131186988462835e-05, 'epoch': 0.31}
 31%|███       | 1092/3507 [26:40<52:15,  1.30s/it]tensor([[-4.2188, -2.7812,  0.4492,  1.6484, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2188, -2.0156,  0.9844,  2.3750, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -2.0469,  2.1562,  1.2891, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -2.1719,  1.6250,  1.1172, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -1.8984,  2.2812, -1.0000, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:26,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.66 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0625, -2.7500,  1.3594,  0.4004, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -2.0000,  1.7500,  0.1738, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -3.0781,  0.0508,  1.8516, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:27,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:11:27,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.11 | bwd_microstep: 702.33 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 701.13 | step_microstep: 2.29
[2025-11-06 18:11:27,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 609.80 | bwd: 703.22 | bwd_inner: 1.89 | bwd_allreduce: 701.20 | step: 2.37
 31%|███       | 1093/3507 [26:41<53:01,  1.32s/it]                                                   {'loss': 0.9294, 'learning_rate': 1.6123887022096397e-05, 'epoch': 0.31}
 31%|███       | 1093/3507 [26:41<53:01,  1.32s/it]tensor([[-4.3125, -2.2500,  1.2500,  0.4395, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:27,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.99 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2500,  0.4043,  2.7500, -1.3359, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -3.4844, -0.4355,  0.5898, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6719, -1.8594,  0.9023,  3.5000, -1.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -3.0000, -0.2256,  2.7031, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -2.5156,  0.9141, -0.5469, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -1.7266,  1.5547, -1.1875, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -2.0156,  2.2031,  2.2188, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:29,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:11:29,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.44 | bwd_microstep: 1580.81 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1579.77 | step_microstep: 1.91
[2025-11-06 18:11:29,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 254.26 | bwd: 1581.55 | bwd_inner: 1.59 | bwd_allreduce: 1579.82 | step: 2.00
 31%|███       | 1094/3507 [26:43<59:37,  1.48s/it]                                                   {'loss': 0.6848, 'learning_rate': 1.6116581830401193e-05, 'epoch': 0.31}
 31%|███       | 1094/3507 [26:43<59:37,  1.48s/it]tensor([[-4.5000, -2.8594,  0.7266,  1.4141, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:29,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.08 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.7812, -0.9883,  2.7031, -0.1484, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -4.8438, -1.3125, -0.3008, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5312, -4.1562, -0.2441, -1.9453, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7656, -0.4531,  3.3594, -1.4141, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -2.7812,  0.8828,  1.2188, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -3.1719, -0.5273,  2.7812, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -2.9375,  0.5664,  2.4531, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:30,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:11:30,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.73 | bwd_microstep: 1125.38 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1124.22 | step_microstep: 2.25
[2025-11-06 18:11:30,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.84 | bwd: 1126.42 | bwd_inner: 2.03 | bwd_allreduce: 1124.26 | step: 2.33
 31%|███       | 1095/3507 [26:44<59:21,  1.48s/it]                                                   {'loss': 0.313, 'learning_rate': 1.6109271419610526e-05, 'epoch': 0.31}
 31%|███       | 1095/3507 [26:44<59:21,  1.48s/it]tensor([[-4.8750, -3.5938, -0.4785,  0.3633, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -1.6641,  1.5312,  0.2578, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -1.1641,  1.7188,  0.1973, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:31,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.46 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.0000, -1.1953,  2.5000, -0.6406, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -3.5469, -1.6641,  1.7812, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -1.2891,  1.5078,  0.2930, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5625, -3.9688,  0.6758, -0.9883, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4375, -1.9922,  0.7383,  0.7930, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:11:31,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:11:31,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.56 | bwd_microstep: 227.91 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 226.70 | step_microstep: 1.53
[2025-11-06 18:11:31,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.04 | bwd: 228.89 | bwd_inner: 2.01 | bwd_allreduce: 226.74 | step: 1.61
 31%|███▏      | 1096/3507 [26:45<48:50,  1.22s/it]                                                   {'loss': 0.5518, 'learning_rate': 1.6101955795962142e-05, 'epoch': 0.31}
 31%|███▏      | 1096/3507 [26:45<48:50,  1.22s/it]tensor([[-4.9062, -2.4375,  1.7266,  0.3867, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.6562,  0.8242,  1.6250, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -2.3594,  1.7656,  1.0391, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -2.1875,  1.9844, -0.7930, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:31,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.33 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.2812, -0.5352,  1.7891, -2.2969, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5938, -3.9375,  0.1162,  1.1875, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -1.4375,  2.8750, -1.7656, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -4.4062, -1.7891,  1.2812, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:33,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:11:33,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.29 | bwd_microstep: 1612.55 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1611.55 | step_microstep: 2.08
[2025-11-06 18:11:33,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 574.65 | bwd: 1613.54 | bwd_inner: 1.82 | bwd_allreduce: 1611.59 | step: 2.17
 31%|███▏      | 1097/3507 [26:47<1:01:06,  1.52s/it]                                                     {'loss': 0.8516, 'learning_rate': 1.6094634965698248e-05, 'epoch': 0.31}
 31%|███▏      | 1097/3507 [26:47<1:01:06,  1.52s/it]tensor([[-3.3594, -3.1719, -1.2109,  2.3438, -1.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -1.9922, -0.1670,  2.2812, -1.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:33,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.44 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2812, -2.8594,  0.4258,  1.4453, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -2.0312,  1.2891,  1.4141, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -2.4219,  1.1797,  0.7969, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.3125, -0.8750,  1.9531, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8125, -3.4688, -0.9141,  2.8750, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -1.9688,  1.8906, -0.5859, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:11:34,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:11:34,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.45 | bwd_microstep: 185.42 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 184.28 | step_microstep: 1.42
[2025-11-06 18:11:34,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.92 | bwd: 186.34 | bwd_inner: 1.90 | bwd_allreduce: 184.32 | step: 1.49
 31%|███▏      | 1098/3507 [26:48<49:05,  1.22s/it]                                                     {'loss': 0.3035, 'learning_rate': 1.6087308935065488e-05, 'epoch': 0.31}
 31%|███▏      | 1098/3507 [26:48<49:05,  1.22s/it]tensor([[-3.2812, -2.5156, -0.0391,  2.4531, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -3.4688,  0.9492,  1.2656, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -3.2031, -0.6367,  2.5625, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -3.8125,  0.7617, -0.9414, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:34,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.38 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-3.6406, -2.5625,  0.2559,  1.8047, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -4.9375, -1.9922,  0.8828, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -2.5156,  1.6094,  0.3691, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6562, -2.2031,  0.0972,  3.2812, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:35,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:11:35,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.50 | bwd_microstep: 648.31 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 646.80 | step_microstep: 1.94
[2025-11-06 18:11:35,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.91 | bwd: 649.23 | bwd_inner: 2.26 | bwd_allreduce: 646.84 | step: 2.02
 31%|███▏      | 1099/3507 [26:49<47:57,  1.19s/it]                                                   {'loss': 0.3815, 'learning_rate': 1.6079977710314944e-05, 'epoch': 0.31}
 31%|███▏      | 1099/3507 [26:49<47:57,  1.19s/it]tensor([[-2.2188, -0.3906,  1.9375,  0.6133, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -2.8438,  0.5078, -1.2109, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -3.0938,  0.1318,  2.8281, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:35,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.41 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -2.3906,  1.2500,  1.2188, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -1.7656,  2.2969,  1.0859, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -1.0547,  2.5000,  0.4473, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6641,  0.6445,  2.5938,  0.1885, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5312, -4.2500,  0.6289,  0.2451, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:35,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:11:35,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.57 | bwd_microstep: 34.30 | bwd_inner_microstep: 1.62 | bwd_allreduce_microstep: 32.60 | step_microstep: 1.78
[2025-11-06 18:11:35,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.00 | bwd: 35.17 | bwd_inner: 2.37 | bwd_allreduce: 32.65 | step: 1.87
 31%|███▏      | 1100/3507 [26:49<38:32,  1.04it/s]                                                   {'loss': 0.3176, 'learning_rate': 1.6072641297702128e-05, 'epoch': 0.31}
 31%|███▏      | 1100/3507 [26:49<38:32,  1.04it/s]tensor([[-4.5312, -3.3281, -0.0669,  1.5156, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:36,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.93 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -3.8750, -0.6992,  2.5312, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.1875, -0.6055,  0.5469, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3516,  2.4219,  3.9375, -0.6641, -0.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.7188, -2.5000,  1.6797,  1.0938, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -2.1562,  0.5547,  0.6523, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -3.5312,  0.9727, -0.3047, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281, -1.8359,  1.1016,  1.5469, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:37,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:11:37,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.14 | bwd_microstep: 1694.17 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1693.03 | step_microstep: 2.00
[2025-11-06 18:11:37,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.07 | bwd: 1695.14 | bwd_inner: 1.92 | bwd_allreduce: 1693.08 | step: 2.08
 31%|███▏      | 1101/3507 [26:51<51:41,  1.29s/it]                                                   {'loss': 0.5852, 'learning_rate': 1.6065299703486986e-05, 'epoch': 0.31}
 31%|███▏      | 1101/3507 [26:51<51:41,  1.29s/it]tensor([[-4.1875, -1.4141,  2.2969, -0.6836, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -1.7500,  1.0234, -1.3281, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -2.3906,  1.0938,  0.4902, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.0312, -0.8555,  2.5156, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:38,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.00 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-4.1875, -1.4688,  1.9688, -1.0312, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -2.0312,  1.5938,  0.8281, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -1.7422,  1.5703, -1.5781, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -4.3438, -1.5391,  0.2578, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:11:40,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:11:40,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.38 | bwd_microstep: 1624.50 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1623.26 | step_microstep: 2.07
[2025-11-06 18:11:40,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 489.40 | bwd: 1625.55 | bwd_inner: 2.06 | bwd_allreduce: 1623.33 | step: 2.19
 31%|███▏      | 1102/3507 [26:53<1:02:10,  1.55s/it]                                                     {'loss': 0.2636, 'learning_rate': 1.605795293393387e-05, 'epoch': 0.31}
 31%|███▏      | 1102/3507 [26:53<1:02:10,  1.55s/it]tensor([[-3.5469, -3.5312, -1.7031,  2.0469, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -0.9102,  1.7266,  0.3438, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:40,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.06 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.2500, -3.6562, -1.0547,  1.8203, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -4.6250, -0.7148, -0.4434, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -2.1406,  2.2500, -0.1807, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -3.9688, -1.2812,  1.8203, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -3.0781, -0.3594,  2.2812, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219, -0.3496,  2.3594, -0.8438, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:40,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:11:40,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.68 | bwd_microstep: 1.63 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.61 | step_microstep: 1.48
[2025-11-06 18:11:40,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.77 | bwd: 2.51 | bwd_inner: 1.68 | bwd_allreduce: 0.66 | step: 1.59
 31%|███▏      | 1103/3507 [26:54<47:43,  1.19s/it]                                                     {'loss': 0.1709, 'learning_rate': 1.6050600995311565e-05, 'epoch': 0.31}
 31%|███▏      | 1103/3507 [26:54<47:43,  1.19s/it]tensor([[-2.2031,  0.9258,  3.2031, -2.3281, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -1.4531,  2.6875, -0.8242, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.6562, -3.5938, -0.2139,  2.0312, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([3], device='cuda:0')
[2025-11-06 18:11:40,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.90 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -1.9453,  1.4609, -0.4727, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -2.7188,  1.7656, -0.0938, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -1.9141,  0.6719,  2.9531, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938, -2.7656,  0.4238,  1.4375, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6094,  0.4102,  3.3750, -1.0938, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:11:41,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:11:41,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.61 | bwd_microstep: 846.29 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 845.02 | step_microstep: 1.62
[2025-11-06 18:11:41,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.54 | bwd: 847.29 | bwd_inner: 2.09 | bwd_allreduce: 845.06 | step: 1.70
 31%|███▏      | 1104/3507 [26:55<48:35,  1.21s/it]                                                   {'loss': 0.1392, 'learning_rate': 1.6043243893893256e-05, 'epoch': 0.31}
 31%|███▏      | 1104/3507 [26:55<48:35,  1.21s/it]tensor([[-5.0000, -4.3750, -1.3516,  1.9062, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0312, -2.9844, -1.4375,  1.8984, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -5.3125, -3.0625,  1.0859, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:11:41,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.65 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5625, -3.6875,  0.5469,  0.8242, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -4.3125, -2.4062,  1.3203, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4531, -2.7344, -0.2656,  2.1406, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.2500, -6.3750, -1.7188, -0.9297, -6.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -4.8438, -1.5938,  1.3281, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:43,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:11:43,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.35 | bwd_microstep: 1057.89 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1056.67 | step_microstep: 3.05
[2025-11-06 18:11:43,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.99 | bwd: 1058.71 | bwd_inner: 1.88 | bwd_allreduce: 1056.70 | step: 3.13
 32%|███▏      | 1105/3507 [26:56<51:23,  1.28s/it]                                                   {'loss': 0.7295, 'learning_rate': 1.603588163595654e-05, 'epoch': 0.32}
 32%|███▏      | 1105/3507 [26:56<51:23,  1.28s/it]tensor([[-4.3125, -2.7188,  0.4219,  0.4082, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:43,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.10 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2656, -0.3594,  2.7969, -1.0156, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -1.5859,  1.4844, -1.0000, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -1.0625,  1.5625,  0.4512, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.0000,  0.6914,  2.1875, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5000,  1.4766,  4.0000, -0.7227, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -2.6406,  1.2656,  0.3105, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -2.1875,  1.5156,  0.7539, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:11:44,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:11:44,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.75 | bwd_microstep: 1318.31 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1317.20 | step_microstep: 2.15
[2025-11-06 18:11:44,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.88 | bwd: 1319.26 | bwd_inner: 1.90 | bwd_allreduce: 1317.23 | step: 2.22
 32%|███▏      | 1106/3507 [26:58<56:29,  1.41s/it]                                                   {'loss': 0.2931, 'learning_rate': 1.6028514227783408e-05, 'epoch': 0.32}
 32%|███▏      | 1106/3507 [26:58<56:29,  1.41s/it]tensor([[-3.3750, -0.8398,  2.1094, -0.6992, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:45,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.50 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9375, -0.8203,  2.6250, -1.8438, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8594, -1.0156,  1.8203,  0.9922, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -2.6562,  1.5781,  1.3359, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -1.5469,  2.3906,  1.2344, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9219, -1.8047,  1.0938,  2.9375, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5000, -2.9844, -0.3867,  2.9062, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.7188, -1.2734,  1.6953, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:11:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.29 | bwd_microstep: 1381.53 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 1380.05 | step_microstep: 2.85
[2025-11-06 18:11:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.82 | bwd: 1382.46 | bwd_inner: 2.21 | bwd_allreduce: 1380.09 | step: 2.93
 32%|███▏      | 1107/3507 [27:00<1:00:45,  1.52s/it]                                                     {'loss': 0.2194, 'learning_rate': 1.602114167566025e-05, 'epoch': 0.32}
 32%|███▏      | 1107/3507 [27:00<1:00:45,  1.52s/it]tensor([[-1.3516,  0.9375,  3.4375,  1.1250, -1.2266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -2.4375,  0.8398,  1.6094, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4414, -0.5859,  0.3770,  4.0312,  0.6992]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:46,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.53 | bwd_microstep: 11.80 | bwd_inner_microstep: 11.64 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-4.1875, -3.4844, -0.4805,  2.5156, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -0.1885,  3.5625, -0.2197, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -3.2812, -0.8125,  2.7500, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6250, -2.4375,  0.5742,  2.1875, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -2.5781,  0.3535,  1.7266, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:11:47,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:11:47,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.47 | bwd_microstep: 468.42 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 467.55 | step_microstep: 2.15
[2025-11-06 18:11:47,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.01 | bwd: 480.22 | bwd_inner: 12.45 | bwd_allreduce: 467.61 | step: 2.29
 32%|███▏      | 1108/3507 [27:01<53:46,  1.34s/it]                                                     {'loss': 0.5313, 'learning_rate': 1.601376398587784e-05, 'epoch': 0.32}
 32%|███▏      | 1108/3507 [27:01<53:46,  1.34s/it]tensor([[-4.5625, -2.8750,  0.7500,  1.3203, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:47,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.74 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.1875, -4.4688, -1.2969,  1.6875, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625, -2.0000,  0.9375,  0.8711, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -3.4531,  0.9883, -3.2031, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -0.3906,  3.3750, -1.5156, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -2.9062, -0.3945,  3.2344, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -1.9766,  1.5234, -0.5625, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219, -0.4199,  2.9844, -1.1328, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:11:49,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:11:49,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.39 | bwd_microstep: 1076.94 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1075.81 | step_microstep: 1.79
[2025-11-06 18:11:49,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.15 | bwd: 1077.83 | bwd_inner: 1.86 | bwd_allreduce: 1075.85 | step: 1.87
 32%|███▏      | 1109/3507 [27:02<55:12,  1.38s/it]                                                   {'loss': 0.2633, 'learning_rate': 1.6006381164731338e-05, 'epoch': 0.32}
 32%|███▏      | 1109/3507 [27:02<55:12,  1.38s/it]tensor([[-0.9570,  1.4766,  3.3750, -0.0923, -1.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -3.0000,  1.2969,  0.2539, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -4.2188, -1.2812,  1.6562, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -3.0312,  0.2031,  1.7109, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:49,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.64 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -2.7344,  1.1094,  0.8750, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031, -3.1406, -1.2656,  2.3906, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5000, -0.8164,  1.8906,  1.5391, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -3.9062, -1.2812,  2.2188, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:11:50,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.18 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:11:50,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.77 | bwd_microstep: 966.34 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 965.33 | step_microstep: 3.88
[2025-11-06 18:11:50,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.44 | bwd: 967.29 | bwd_inner: 1.79 | bwd_allreduce: 965.38 | step: 3.96
 32%|███▏      | 1110/3507 [27:04<57:33,  1.44s/it]                                                   {'loss': 0.2593, 'learning_rate': 1.599899321852029e-05, 'epoch': 0.32}
 32%|███▏      | 1110/3507 [27:04<57:33,  1.44s/it]tensor([[ 0.1177,  2.6875,  4.7188,  1.2656, -0.2402]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:50,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.94 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.6250, -2.5938,  0.8906, -0.1855, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -2.8750,  0.7891, -0.4121, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0469,  0.7266,  2.8281, -1.4219, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -1.6562,  2.5938, -1.4375, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -1.5312,  2.9062, -1.2969, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -1.3359,  2.3281,  1.5938, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6719, -2.7812,  0.1748,  2.5938, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:52,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:11:52,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.55 | bwd_microstep: 970.31 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 969.23 | step_microstep: 34.95
[2025-11-06 18:11:52,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 443.50 | bwd: 971.55 | bwd_inner: 2.16 | bwd_allreduce: 969.27 | step: 35.02
 32%|███▏      | 1111/3507 [27:05<58:23,  1.46s/it]                                                   {'loss': 0.2791, 'learning_rate': 1.5991600153548602e-05, 'epoch': 0.32}
 32%|███▏      | 1111/3507 [27:05<58:23,  1.46s/it]tensor([[-3.6250, -2.9375, -0.2197,  2.2969, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.1406, -0.0222,  1.9609, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5469,  0.7383,  3.6719,  1.8047, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -1.1172,  2.7500, -0.5820, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -1.7578,  1.0859, -1.3750, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5312, -2.8906,  1.6875, -0.1045, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -1.9922,  1.8281, -0.4844, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:52,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.88 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.3594, -1.6328,  0.5117,  2.7969, -1.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:53,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:11:53,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.89 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.35
[2025-11-06 18:11:53,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.78 | bwd: 3.06 | bwd_inner: 2.05 | bwd_allreduce: 0.87 | step: 2.44
 32%|███▏      | 1112/3507 [27:06<51:37,  1.29s/it]                                                   {'loss': 0.6129, 'learning_rate': 1.5984201976124554e-05, 'epoch': 0.32}
 32%|███▏      | 1112/3507 [27:06<51:37,  1.29s/it]tensor([[-0.2676,  1.9062,  3.1875,  0.8438, -0.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.2812, -2.6094,  1.3047, -1.3828, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -3.4688, -0.0703,  0.0547, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -0.1562,  3.6875, -0.6875, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:53,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.06 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -1.4844,  1.4531, -1.8828, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -2.9531, -0.0168,  1.2344, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -2.8438,  0.2266,  1.6484, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -3.0781, -0.6055,  2.3125, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:53,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.81 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:11:53,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.56 | bwd_microstep: 2.36 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.48
[2025-11-06 18:11:53,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 269.63 | bwd: 3.28 | bwd_inner: 2.33 | bwd_allreduce: 0.83 | step: 2.56
 32%|███▏      | 1113/3507 [27:07<47:13,  1.18s/it]                                                   {'loss': 0.405, 'learning_rate': 1.5976798692560796e-05, 'epoch': 0.32}
 32%|███▏      | 1113/3507 [27:07<47:13,  1.18s/it]tensor([[-3.9531, -1.5078,  1.8516, -0.4219, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.6875,  1.6797,  0.8906, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2188, -0.0762,  2.8906, -2.1406, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -2.8906,  0.1221,  3.1250, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.4375,  1.4375, -0.0194, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -3.7188, -0.1484,  1.1875, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -2.7656,  0.9141, -2.4688, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:11:55,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.79 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.5938, -4.3750, -1.7656,  1.9922, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:55,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.39 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:11:55,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.26 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.84 | step_microstep: 4.04
[2025-11-06 18:11:55,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.07 | bwd: 2.76 | bwd_inner: 1.76 | bwd_allreduce: 0.88 | step: 4.13
 32%|███▏      | 1114/3507 [27:09<55:11,  1.38s/it]                                                   {'loss': 0.3343, 'learning_rate': 1.596939030917432e-05, 'epoch': 0.32}
 32%|███▏      | 1114/3507 [27:09<55:11,  1.38s/it]tensor([[-5.7812, -3.2656,  1.1406, -0.6328, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8594, -2.0312,  0.5938,  2.8125, -1.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -3.0781, -0.3359,  2.7031, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:56,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.34 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.8594, -2.0938,  1.1406,  0.9062, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7109,  1.0391,  3.0781, -1.3594, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -2.5312,  0.5938,  1.5703, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -2.4531,  0.7891,  2.1562, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8906, -2.5938,  0.6328,  1.9609, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:11:56,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.83 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:11:56,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.77 | bwd_microstep: 53.67 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 52.63 | step_microstep: 5.11
[2025-11-06 18:11:56,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.13 | bwd: 54.48 | bwd_inner: 1.68 | bwd_allreduce: 52.66 | step: 5.19
 32%|███▏      | 1115/3507 [27:10<44:58,  1.13s/it]                                                   {'loss': 0.2746, 'learning_rate': 1.5961976832286478e-05, 'epoch': 0.32}
 32%|███▏      | 1115/3507 [27:10<44:58,  1.13s/it]tensor([[-4.2500, -3.2344, -0.0469,  1.8516, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -3.3750,  0.2188,  0.9180, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8125,  0.3887,  2.0000, -1.0938, -1.8047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -4.0938, -1.2578,  2.2031, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -1.7891,  1.7188,  0.2061, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -1.9609,  0.8594,  0.7227, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -2.5000, -0.0977,  1.4922, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:11:59,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.34 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0938, -4.4062, -1.1641,  1.9375, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:11:59,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.23 | optimizer_step: 0.18
[2025-11-06 18:11:59,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.87 | bwd_microstep: 1.89 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.29
[2025-11-06 18:11:59,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 431.22 | bwd: 2.69 | bwd_inner: 1.82 | bwd_allreduce: 0.76 | step: 2.36
 32%|███▏      | 1116/3507 [27:13<1:15:21,  1.89s/it]                                                     {'loss': 1.6494, 'learning_rate': 1.5954558268222974e-05, 'epoch': 0.32}
 32%|███▏      | 1116/3507 [27:13<1:15:21,  1.89s/it]tensor([[-3.2812, -3.2500, -1.2266,  2.7344, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-7.1562, -6.2812, -2.9688, -0.8047, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:2')
tensor([2], device='cuda:1')
tensor([[-2.5312, -2.5781, -0.7969,  3.1094, -0.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:12:00,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.46 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-8.0000, -6.0938, -1.0703, -0.1104, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.5781,  0.0211,  1.9219, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.0781,  0.5273,  1.3281, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8125,  0.4648,  2.5312, -0.3027, -1.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.4062, -2.0156,  2.1719,  1.1328, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:12:00,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:00,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.18 | bwd_microstep: 40.40 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 39.44 | step_microstep: 1.41
[2025-11-06 18:12:00,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.67 | bwd: 41.42 | bwd_inner: 1.84 | bwd_allreduce: 39.47 | step: 1.48
 32%|███▏      | 1117/3507 [27:14<57:11,  1.44s/it]                                                     {'loss': 1.2347, 'learning_rate': 1.5947134623313834e-05, 'epoch': 0.32}
 32%|███▏      | 1117/3507 [27:14<57:11,  1.44s/it]tensor([[-4.0938, -3.2500, -0.2578,  2.4219, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:00,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.71 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.3125, -3.3906,  1.0391,  1.1719, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -4.4375,  0.2500,  0.7461, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -2.0156,  1.7266, -0.4043, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0000, -3.1094, -0.1973,  1.7812, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9375, -0.3203,  2.3906, -1.2109, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -2.3906,  1.1953,  1.2031, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.0000,  0.2988,  1.7812, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:01,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:12:01,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.97 | bwd_microstep: 564.66 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 563.58 | step_microstep: 1.80
[2025-11-06 18:12:01,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.69 | bwd: 565.66 | bwd_inner: 1.91 | bwd_allreduce: 563.62 | step: 1.87
 32%|███▏      | 1118/3507 [27:15<50:38,  1.27s/it]                                                   {'loss': 0.5798, 'learning_rate': 1.593970590389344e-05, 'epoch': 0.32}
 32%|███▏      | 1118/3507 [27:15<50:38,  1.27s/it]tensor([[-3.3594, -1.3672,  1.5156,  0.4316, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -0.6914,  2.0938, -1.1641, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:01,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.92 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1250, -2.5156,  0.9727, -1.7969, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -3.2656,  0.1328,  1.5156, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -4.9062, -0.8750,  0.9805, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -1.0625,  2.6094,  1.6719, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -2.1719,  0.8750,  0.9805, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -1.9844,  2.2969, -0.0752, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:01,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:12:01,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.50 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.53
[2025-11-06 18:12:01,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 500.44 | bwd: 2.80 | bwd_inner: 1.83 | bwd_allreduce: 0.84 | step: 1.61
 32%|███▏      | 1119/3507 [27:15<41:56,  1.05s/it]                                                   {'loss': 0.7597, 'learning_rate': 1.5932272116300493e-05, 'epoch': 0.32}
 32%|███▏      | 1119/3507 [27:15<41:56,  1.05s/it]tensor([[-3.6719, -3.0938, -0.0422,  3.3438, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5625, -3.2344, -0.8594,  2.4531, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -1.9688,  1.2188,  1.5469, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688,  0.3555,  3.4375, -2.1406, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -2.8750,  1.9375, -1.3047, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:02,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.35 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.1875, -1.8203,  2.3594,  0.9062, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -3.3906,  0.3652,  1.3594, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.1719, -0.1084,  2.1250, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:04,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:12:04,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.65 | bwd_microstep: 372.29 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 371.30 | step_microstep: 2.13
[2025-11-06 18:12:04,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.03 | bwd: 373.30 | bwd_inner: 1.81 | bwd_allreduce: 371.35 | step: 2.23
 32%|███▏      | 1120/3507 [27:18<1:01:35,  1.55s/it]                                                     {'loss': 0.2209, 'learning_rate': 1.5924833266878015e-05, 'epoch': 0.32}
 32%|███▏      | 1120/3507 [27:18<1:01:35,  1.55s/it]tensor([[-2.9375, -0.3984,  2.4844, -0.5078, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -2.4688,  1.1484,  0.2734, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -3.2656,  0.5430,  1.7500, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:04,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6094, -0.3613,  3.2969, -1.5547, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -3.1875,  0.3516,  1.7578, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7344, -0.3242,  2.6094,  3.7656, -0.7930]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -1.9141,  1.4219,  1.4297, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.4219,  0.8867,  2.4375, -0.8164, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:12:04,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:12:04,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 49.39 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 48.22 | step_microstep: 1.52
[2025-11-06 18:12:04,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.84 | bwd: 50.25 | bwd_inner: 1.87 | bwd_allreduce: 48.26 | step: 1.61
 32%|███▏      | 1121/3507 [27:18<48:59,  1.23s/it]                                                     {'loss': 0.4812, 'learning_rate': 1.5917389361973365e-05, 'epoch': 0.32}
 32%|███▏      | 1121/3507 [27:18<48:59,  1.23s/it]tensor([[-5.1250, -4.3750, -1.0703,  1.7812, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -2.0156,  1.6172, -0.3691, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:05,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.72 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5938, -0.2988,  3.4062, -1.7344, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.8750,  0.9531,  0.2793, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -3.8906, -0.8828,  2.8125, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -3.9062, -0.5977,  1.3750, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -1.7344,  1.9766, -0.6875, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7734, -1.8906, -0.2793,  3.8125, -0.2754]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:12:06,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:06,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.78 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.94
[2025-11-06 18:12:06,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.53 | bwd: 2.74 | bwd_inner: 1.73 | bwd_allreduce: 0.89 | step: 2.03
 32%|███▏      | 1122/3507 [27:20<56:09,  1.41s/it]                                                   {'loss': 0.9929, 'learning_rate': 1.590994040793819e-05, 'epoch': 0.32}
 32%|███▏      | 1122/3507 [27:20<56:09,  1.41s/it]tensor([[-2.7344, -1.5156,  0.7461,  0.9531, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.6875, -7.4375, -3.1250, -1.4375, -6.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:07,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.16 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -2.4531,  0.8438,  0.8125, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5781, -0.5234,  2.2969,  0.6953, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -2.3281,  0.8672,  1.5781, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6406, -0.6602,  2.8281, -1.6484, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.5781,  0.8516,  1.6172, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1406,  0.4160,  2.5781, -0.7422, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:12:07,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:12:07,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.43 | bwd_microstep: 276.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 276.04 | step_microstep: 1.80
[2025-11-06 18:12:07,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 257.61 | bwd: 277.62 | bwd_inner: 1.43 | bwd_allreduce: 276.07 | step: 1.88
 32%|███▏      | 1123/3507 [27:21<46:02,  1.16s/it]                                                   {'loss': 0.6285, 'learning_rate': 1.590248641112847e-05, 'epoch': 0.32}
 32%|███▏      | 1123/3507 [27:21<46:02,  1.16s/it]tensor([[-3.0312, -0.1562,  2.6406, -1.5234, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -2.8281,  0.6797,  2.5469, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:07,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.17 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.7812, -4.1875,  0.0630,  1.1250, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -1.5703,  2.4062, -0.7070, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9609, -0.2988,  1.7500,  0.9062, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -3.3125, -0.0713,  2.7656, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.0625,  2.0469, -0.5820, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3750, -3.5781, -0.4980,  2.0781, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:10,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:12:10,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.41 | bwd_microstep: 2296.10 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 2295.06 | step_microstep: 2.05
[2025-11-06 18:12:10,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.59 | bwd: 2296.87 | bwd_inner: 1.63 | bwd_allreduce: 2295.10 | step: 2.13
 32%|███▏      | 1124/3507 [27:24<1:08:20,  1.72s/it]                                                     {'loss': 0.4043, 'learning_rate': 1.5895027377904468e-05, 'epoch': 0.32}
 32%|███▏      | 1124/3507 [27:24<1:08:20,  1.72s/it]tensor([[-4.5000, -1.4922,  3.0000, -0.5625, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1719, -0.0806,  2.5938, -2.0469, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:10,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.45 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3438, -5.0625, -2.1250,  1.7266, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5078,  1.1328,  2.6406, -1.6406, -1.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -2.2969,  0.8281, -0.9102, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -3.0625,  0.4902, -1.0781, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4375,  0.3633,  2.9062, -1.1875, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -3.1719,  0.4609,  1.7812, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:10,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:12:10,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.96 | bwd_microstep: 178.36 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 177.27 | step_microstep: 1.60
[2025-11-06 18:12:10,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.43 | bwd: 179.22 | bwd_inner: 1.78 | bwd_allreduce: 177.31 | step: 1.69
 32%|███▏      | 1125/3507 [27:24<54:24,  1.37s/it]                                                     {'loss': 0.1464, 'learning_rate': 1.5887563314630753e-05, 'epoch': 0.32}
 32%|███▏      | 1125/3507 [27:24<54:24,  1.37s/it]tensor([[-3.0156, -0.5078,  2.8281,  0.3574, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -1.6016,  1.3828, -0.3281, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -1.8281,  1.4453,  0.8281, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-9.5625, -6.4062, -0.9023, -4.4062, -8.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -0.8555,  2.4531, -1.5391, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:11,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.21 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9297,  0.6719,  2.6562, -0.9844, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5469, -0.5469,  2.7500, -1.4375, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.3613,  2.3438,  4.8438,  1.1484, -0.6680]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:12:13,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:12:13,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.17 | bwd_microstep: 1708.15 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1706.92 | step_microstep: 2.45
[2025-11-06 18:12:13,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.41 | bwd: 1709.06 | bwd_inner: 1.97 | bwd_allreduce: 1706.96 | step: 2.53
 32%|███▏      | 1126/3507 [27:27<1:04:35,  1.63s/it]                                                     {'loss': 0.4032, 'learning_rate': 1.5880094227676192e-05, 'epoch': 0.32}
 32%|███▏      | 1126/3507 [27:27<1:04:35,  1.63s/it]tensor([[-2.4688, -2.3906, -0.5469,  3.1562, -0.9492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -4.4688, -1.5234,  0.7695, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:13,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.95 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-6.3125, -3.6250,  1.3906, -0.4414, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.1406,  0.0101,  1.7891, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -1.5703,  2.0625, -0.9727, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7266, -1.8750, -0.5938,  3.0938, -0.3301]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.6250, -4.5625,  0.6211, -2.3906, -6.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5312, -2.0625,  0.1270,  2.7344, -1.1953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:12:13,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:12:13,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.65 | bwd_microstep: 39.66 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 38.26 | step_microstep: 1.43
[2025-11-06 18:12:13,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.62 | bwd: 40.50 | bwd_inner: 2.09 | bwd_allreduce: 38.29 | step: 1.49
 32%|███▏      | 1127/3507 [27:27<49:58,  1.26s/it]                                                     {'loss': 0.967, 'learning_rate': 1.587262012341393e-05, 'epoch': 0.32}
 32%|███▏      | 1127/3507 [27:27<49:58,  1.26s/it]tensor([[-3.6562, -1.3438,  1.2109, -1.5000, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:13,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.41 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2812, -3.1094,  1.3906,  0.5742, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -3.8594, -1.8672,  1.6797, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1875, -4.4062,  1.0000, -1.0938, -5.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.0781,  0.0171,  1.5000, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4844, -2.4219,  0.7227,  2.5469, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5938, -2.4844,  0.4531,  1.6328, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -2.7656,  0.1543,  1.9766, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:15,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:12:15,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.19 | bwd_microstep: 1760.28 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 1758.60 | step_microstep: 1.90
[2025-11-06 18:12:15,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.62 | bwd: 1761.32 | bwd_inner: 2.54 | bwd_allreduce: 1758.65 | step: 1.99
 32%|███▏      | 1128/3507 [27:29<1:00:06,  1.52s/it]                                                     {'loss': 0.4589, 'learning_rate': 1.5865141008221394e-05, 'epoch': 0.32}
 32%|███▏      | 1128/3507 [27:29<1:00:06,  1.52s/it]tensor([[-4.2812, -2.2500,  1.6797,  1.3281, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -2.5938,  1.7578,  0.2891, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -1.0938,  2.3594, -0.7188, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.2715, 2.3750, 4.1562, 2.4531, 0.2109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:15,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.88 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-3.0781, -2.3125,  0.2715,  2.1406, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -3.6719, -1.4375,  1.9453, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4062, -1.3047,  2.0000,  1.0469, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -3.7344,  0.8711,  1.1562, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:16,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:16,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.38 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.88 | step_microstep: 1.71
[2025-11-06 18:12:16,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 511.30 | bwd: 3.14 | bwd_inner: 2.02 | bwd_allreduce: 0.94 | step: 1.83
 32%|███▏      | 1129/3507 [27:30<48:43,  1.23s/it]                                                     {'loss': 0.373, 'learning_rate': 1.5857656888480287e-05, 'epoch': 0.32}
 32%|███▏      | 1129/3507 [27:30<48:43,  1.23s/it]tensor([[-4.2812, -2.7969,  0.6406,  1.0156, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -2.1250,  0.8672,  2.2344, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:16,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 1.19 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.6875, -3.0781,  0.7461,  1.2344, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -0.7891,  2.7031, -1.0391, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -5.3125, -2.3281,  1.2109, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0000,  0.0327,  3.6094, -0.0952, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -4.2500, -0.7969,  1.5859, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.2031,  0.0938,  1.2422, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:18,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:12:18,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.69 | bwd_microstep: 2252.02 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2250.82 | step_microstep: 2.30
[2025-11-06 18:12:18,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.38 | bwd: 2253.21 | bwd_inner: 2.21 | bwd_allreduce: 2250.87 | step: 2.39
 32%|███▏      | 1130/3507 [27:32<1:05:49,  1.66s/it]                                                     {'loss': 0.2775, 'learning_rate': 1.585016777057659e-05, 'epoch': 0.32}
 32%|███▏      | 1130/3507 [27:32<1:05:49,  1.66s/it]tensor([[-5.0000, -2.7812,  0.7500, -1.1016, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7031, -0.1030,  2.2656, -1.7188, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1250, -3.7500,  1.1562,  0.0752, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:19,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.55 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3438, -4.9062, -1.7891,  1.5938, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7891,  0.8828,  2.8750, -1.4609, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0781, -0.8867,  1.7344,  2.8750, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -4.0312, -0.5117,  2.1875, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -3.6094, -0.0820,  1.5938, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:19,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:12:19,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.58 | bwd_microstep: 22.38 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 21.24 | step_microstep: 2.12
[2025-11-06 18:12:19,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.16 | bwd: 23.60 | bwd_inner: 2.20 | bwd_allreduce: 21.27 | step: 2.21
 32%|███▏      | 1131/3507 [27:33<51:12,  1.29s/it]                                                     {'loss': 0.3065, 'learning_rate': 1.5842673660900536e-05, 'epoch': 0.32}
 32%|███▏      | 1131/3507 [27:33<51:12,  1.29s/it]tensor([[-4.2812, -2.5625,  1.1328,  1.2422, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1089,  1.1641,  3.2812,  4.1562,  0.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -1.5078,  2.0781,  0.2197, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:19,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.17 | bwd_microstep: 1.20 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5312, -2.0000,  0.4648,  3.4062, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.0781,  1.3203,  2.1094, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -3.1094,  0.8164, -1.2500, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -3.8906,  0.5352,  0.9297, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.8281, -0.3691,  2.1719, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:20,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:20,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.68 | bwd_microstep: 338.02 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 336.77 | step_microstep: 1.50
[2025-11-06 18:12:20,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.87 | bwd: 339.21 | bwd_inner: 2.29 | bwd_allreduce: 336.80 | step: 1.58
 32%|███▏      | 1132/3507 [27:33<44:12,  1.12s/it]                                                   {'loss': 0.3055, 'learning_rate': 1.5835174565846624e-05, 'epoch': 0.32}
 32%|███▏      | 1132/3507 [27:33<44:12,  1.12s/it]tensor([[-3.2031,  0.1465,  3.2500, -2.6562, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -2.3906,  1.3672,  1.1406, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:20,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.92 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5625, -3.0781,  0.6875,  1.7109, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9688, -4.2500,  0.6680, -1.3281, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -0.8438,  2.2500, -0.3613, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938, -2.8906, -0.4297,  1.7578, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -1.7266,  1.8047,  0.2676, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1250, -2.9688,  0.4648,  2.3281, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:12:22,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:12:22,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.58 | bwd_microstep: 2055.68 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 2054.68 | step_microstep: 1.65
[2025-11-06 18:12:22,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.53 | bwd: 2056.79 | bwd_inner: 1.95 | bwd_allreduce: 2054.72 | step: 1.72
 32%|███▏      | 1133/3507 [27:36<1:02:11,  1.57s/it]                                                     {'loss': 0.3475, 'learning_rate': 1.582767049181361e-05, 'epoch': 0.32}
 32%|███▏      | 1133/3507 [27:36<1:02:11,  1.57s/it]tensor([[-7.6250, -5.6562, -1.0625, -1.4609, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -2.8125,  0.7891,  0.0544, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.3438, -0.4199,  2.1875, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:22,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.98 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.8438, -3.2969,  0.6523, -1.9141, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -1.5859,  2.5469, -0.8516, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -2.4688,  2.3281, -0.7930, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -2.9844,  0.5781,  2.3750, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9219,  0.9102,  3.3594, -1.0703, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:12:23,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:23,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.92 | bwd_microstep: 168.16 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 167.01 | step_microstep: 1.45
[2025-11-06 18:12:23,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.93 | bwd: 169.20 | bwd_inner: 2.01 | bwd_allreduce: 167.05 | step: 1.53
 32%|███▏      | 1134/3507 [27:37<49:47,  1.26s/it]                                                     {'loss': 0.2375, 'learning_rate': 1.582016144520449e-05, 'epoch': 0.32}
 32%|███▏      | 1134/3507 [27:37<49:47,  1.26s/it]tensor([[-5.0312, -3.7031, -0.0454,  1.1016, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3125, -2.8594, -0.1523,  3.1875, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -3.8125, -0.3828,  2.0938, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:23,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.38 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.7812, -4.3125, -0.3613,  0.3027, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -3.5625, -1.5312,  1.2344, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156, -2.6250, -0.3164,  2.8438, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2188,  0.5195,  2.8750, -1.5547, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -3.1719,  0.5430,  1.8281, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:12:24,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:12:24,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.61 | bwd_microstep: 1165.41 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1164.22 | step_microstep: 1.94
[2025-11-06 18:12:24,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.00 | bwd: 1166.24 | bwd_inner: 1.86 | bwd_allreduce: 1164.25 | step: 2.01
 32%|███▏      | 1135/3507 [27:38<53:06,  1.34s/it]                                                   {'loss': 0.3172, 'learning_rate': 1.5812647432426512e-05, 'epoch': 0.32}
 32%|███▏      | 1135/3507 [27:38<53:06,  1.34s/it]tensor([[-2.7031,  0.5156,  3.3125, -1.7109, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -1.4141,  2.6875, -0.3945, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -4.2812, -1.1406,  1.5156, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5938, -5.5312, -0.2148,  0.0791, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:25,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.36 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.8125, -4.1562, -0.8203,  2.5469, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375, -2.6562,  0.2031,  2.4062, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -4.2188,  0.0452,  1.6016, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -4.1250, -0.3613,  2.3906, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:25,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 18:12:25,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.35 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.64
[2025-11-06 18:12:25,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.75 | bwd: 2.87 | bwd_inner: 1.83 | bwd_allreduce: 0.90 | step: 1.75
 32%|███▏      | 1136/3507 [27:39<41:52,  1.06s/it]                                                   {'loss': 0.1875, 'learning_rate': 1.5805128459891154e-05, 'epoch': 0.32}
 32%|███▏      | 1136/3507 [27:39<41:52,  1.06s/it]tensor([[-4.3125, -2.7188,  0.8438,  1.3281, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -3.3281,  0.7852,  0.7891, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:25,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.51 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.4375, -1.0547,  2.6406,  0.7188, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -3.0312,  0.0066,  1.3984, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -1.0312,  2.2969, -0.9805, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -2.7031,  1.0156, -0.3242, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -2.2344,  1.7500,  0.9062, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8438, -1.8906,  1.3047,  0.4453, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:12:28,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:12:28,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.85 | bwd_microstep: 2082.24 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 2080.82 | step_microstep: 2.20
[2025-11-06 18:12:28,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.37 | bwd: 2083.09 | bwd_inner: 2.11 | bwd_allreduce: 2080.85 | step: 2.27
 32%|███▏      | 1137/3507 [27:42<1:04:41,  1.64s/it]                                                     {'loss': 0.4489, 'learning_rate': 1.5797604534014134e-05, 'epoch': 0.32}
 32%|███▏      | 1137/3507 [27:42<1:04:41,  1.64s/it]tensor([[-3.8438, -3.2031, -0.5039,  2.0312, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -4.4062, -1.0938,  1.0391, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1094, -0.7539,  1.9531,  2.7656, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:28,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.39 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1250, -1.7734,  1.6719, -0.5820, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9219, -2.9688,  0.5195,  3.1719, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -2.7500,  1.4297, -0.3965, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -2.0469,  0.6992,  1.4062, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7500e+00, -1.7624e-03,  3.1562e+00, -6.2109e-01, -2.6094e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:12:28,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:12:28,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.21 | bwd_microstep: 200.08 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 198.91 | step_microstep: 2.98
[2025-11-06 18:12:28,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.63 | bwd: 201.00 | bwd_inner: 1.91 | bwd_allreduce: 198.95 | step: 3.07
 32%|███▏      | 1138/3507 [27:42<52:49,  1.34s/it]                                                     {'loss': 0.4187, 'learning_rate': 1.5790075661215384e-05, 'epoch': 0.32}
 32%|███▏      | 1138/3507 [27:42<52:49,  1.34s/it]tensor([[-2.9688, -1.0078,  1.9375,  0.6016, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:28,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.23 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2812, -3.5312,  0.5156,  0.7773, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -3.7188,  0.2773,  1.9922, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7656, -0.4531,  3.1562, -2.1094, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -2.9062,  1.5625,  0.1699, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -3.8281, -0.7305,  1.9453, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.6719,  1.1875,  2.6406, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7344, -2.7812, -1.5938,  1.2266, -1.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:12:29,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:29,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.46 | bwd_microstep: 196.00 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 194.49 | step_microstep: 1.81
[2025-11-06 18:12:29,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 259.71 | bwd: 197.11 | bwd_inner: 2.46 | bwd_allreduce: 194.53 | step: 1.89
 32%|███▏      | 1139/3507 [27:43<42:48,  1.08s/it]                                                   {'loss': 1.0744, 'learning_rate': 1.5782541847919075e-05, 'epoch': 0.32}
 32%|███▏      | 1139/3507 [27:43<42:48,  1.08s/it]tensor([[-5.2500, -2.0625,  3.0781, -0.1357, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1406, -3.0625, -1.6094,  1.1875, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -5.5625, -1.4688,  1.3516, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:29,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.64 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.9688, -0.5156,  2.7656,  0.2178, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5156, -3.3438, -1.0234,  2.5469, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -2.8281,  2.3438,  0.1055, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6719,  1.2188,  3.0469, -1.7422, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3906, -2.4688, -1.1250,  2.2031, -0.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:31,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:12:31,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.10 | bwd_microstep: 1497.98 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 1496.65 | step_microstep: 1.95
[2025-11-06 18:12:31,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.77 | bwd: 1498.83 | bwd_inner: 2.02 | bwd_allreduce: 1496.68 | step: 2.02
 33%|███▎      | 1140/3507 [27:44<52:01,  1.32s/it]                                                   {'loss': 0.0976, 'learning_rate': 1.5775003100553577e-05, 'epoch': 0.33}
 33%|███▎      | 1140/3507 [27:45<52:01,  1.32s/it]tensor([[-2.0781,  1.1250,  3.5312, -1.7734, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -2.5938,  2.1250,  0.6445, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -4.2188,  0.5391,  1.0547, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -3.7656,  1.2031, -0.8906, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:31,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.72 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.8125, -0.7383,  1.9922, -0.0566, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -1.0234,  2.4688, -0.9297, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.2344, -0.4531,  2.6719, -1.0469, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1094, -1.0938,  1.6719,  3.5469, -0.9570]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:12:31,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:12:31,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 60.65 | bwd_microstep: 284.58 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 283.54 | step_microstep: 2.53
[2025-11-06 18:12:31,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.38 | bwd: 285.55 | bwd_inner: 1.85 | bwd_allreduce: 283.58 | step: 2.61
 33%|███▎      | 1141/3507 [27:45<44:33,  1.13s/it]                                                   {'loss': 0.8746, 'learning_rate': 1.576745942555148e-05, 'epoch': 0.33}
 33%|███▎      | 1141/3507 [27:45<44:33,  1.13s/it]tensor([[-4.8125, -3.8750, -0.3457,  2.2812, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.1562,  1.2578,  0.7109, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:32,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.96 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.3906,  0.3926,  2.9062, -0.9531, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.1562, -0.7148,  2.3125, -0.1738, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.7188, -0.0109,  1.5000, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -0.9531,  2.8594, -0.8125, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -1.8047,  1.5156, -0.2871, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -1.0078,  2.4531, -2.5781, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:33,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:33,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.76 | bwd_microstep: 2.45 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.73
[2025-11-06 18:12:33,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.72 | bwd: 3.37 | bwd_inner: 2.41 | bwd_allreduce: 0.84 | step: 2.80
 33%|███▎      | 1142/3507 [27:47<51:20,  1.30s/it]                                                   {'loss': 0.4751, 'learning_rate': 1.5759910829349568e-05, 'epoch': 0.33}
 33%|███▎      | 1142/3507 [27:47<51:20,  1.30s/it]tensor([[-4.8125, -3.4062,  0.2393,  1.2344, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.2500,  0.2109,  1.8594, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -1.6094,  1.6562, -0.6328, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -3.6094,  0.1221,  2.5312, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:33,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.31 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.7500, -2.0938,  1.4609,  1.8672, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -1.9375,  2.3125,  0.0109, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.4531,  0.9375,  2.3594, -1.2891, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -2.8125, -0.1582,  0.6758, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:34,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.09 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:12:34,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.89 | bwd_microstep: 1.82 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.85 | step_microstep: 3.67
[2025-11-06 18:12:34,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 441.22 | bwd: 2.72 | bwd_inner: 1.72 | bwd_allreduce: 0.88 | step: 3.74
 33%|███▎      | 1143/3507 [27:47<41:45,  1.06s/it]                                                   {'loss': 0.4353, 'learning_rate': 1.575235731838884e-05, 'epoch': 0.33}
 33%|███▎      | 1143/3507 [27:47<41:45,  1.06s/it]tensor([[-2.9688, -2.6719, -0.8086,  1.8047, -1.5703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -3.7969, -1.3984,  1.5547, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -3.0312, -0.7305,  1.8672, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:12:34,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.91 | bwd_microstep: 1.15 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7812, -4.2500, -1.3750,  1.5547, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -2.2344,  1.2891, -1.2422, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.0938, -1.0078,  1.3672, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -2.3594,  1.9922, -1.6250, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6875, -0.3301,  2.4219, -0.1177, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:12:35,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:12:35,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.81 | bwd_microstep: 644.42 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 643.06 | step_microstep: 1.77
[2025-11-06 18:12:35,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.74 | bwd: 645.55 | bwd_inner: 2.30 | bwd_allreduce: 643.09 | step: 1.84
 33%|███▎      | 1144/3507 [27:49<47:14,  1.20s/it]                                                   {'loss': 1.0333, 'learning_rate': 1.5744798899114476e-05, 'epoch': 0.33}
 33%|███▎      | 1144/3507 [27:49<47:14,  1.20s/it]tensor([[-5.2812, -4.7812, -1.6250,  1.5078, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7344,  0.9805,  3.2188, -0.7773, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:35,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.60 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8125, -3.6719, -0.2021,  1.1406, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6250, -2.1719,  0.7695,  0.8203, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7344, -1.8516,  0.8047,  3.0312, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -0.5781,  2.4062, -1.8984, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9375, -4.3125,  0.0801, -2.5781, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -5.0000, -1.9609,  1.2031, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:12:36,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:12:36,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.97 | bwd_microstep: 164.63 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 163.12 | step_microstep: 2.37
[2025-11-06 18:12:36,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.59 | bwd: 165.59 | bwd_inner: 2.27 | bwd_allreduce: 163.16 | step: 2.46
 33%|███▎      | 1145/3507 [27:50<44:01,  1.12s/it]                                                   {'loss': 0.1906, 'learning_rate': 1.573723557797585e-05, 'epoch': 0.33}
 33%|███▎      | 1145/3507 [27:50<44:01,  1.12s/it]tensor([[-4.3438, -3.2500,  0.2363,  2.0469, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -1.6016,  1.3750,  3.3125, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:36,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.27 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.1562, -4.8438,  0.2275, -0.2051, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -2.5625,  2.3750, -0.2412, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5000, -5.5312, -1.4688,  1.0312, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9062, -5.3750, -0.7617,  0.5664, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938, -1.1250,  2.1094, -0.9531, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.6562, -2.9688,  1.9453, -0.1416, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:12:38,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.20 | optimizer_step: 0.27
[2025-11-06 18:12:38,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.74 | bwd_microstep: 649.39 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 648.21 | step_microstep: 2.41
[2025-11-06 18:12:38,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.95 | bwd: 650.29 | bwd_inner: 1.89 | bwd_allreduce: 648.25 | step: 2.50
 33%|███▎      | 1146/3507 [27:51<49:47,  1.27s/it]                                                   {'loss': 0.8117, 'learning_rate': 1.572966736142651e-05, 'epoch': 0.33}
 33%|███▎      | 1146/3507 [27:51<49:47,  1.27s/it]tensor([[-1.2734,  1.4375,  4.1250,  0.3008, -1.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.5156,  0.7227,  0.6797, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.3594,  0.7070,  1.9766, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:38,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.41 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.4844, -1.0391,  1.7109,  1.9219, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9844, -3.1875, -1.7656,  1.9219, -1.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.4375, -3.1719, -0.9062,  2.7500, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -4.8438, -1.3125,  1.7422, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -2.4844,  1.1875,  1.4922, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:39,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:39,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.70 | bwd_microstep: 1.79 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.65 | step_microstep: 2.69
[2025-11-06 18:12:39,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.09 | bwd: 2.71 | bwd_inner: 1.90 | bwd_allreduce: 0.69 | step: 2.78
 33%|███▎      | 1147/3507 [27:53<50:17,  1.28s/it]                                                   {'loss': 0.7816, 'learning_rate': 1.5722094255924198e-05, 'epoch': 0.33}
 33%|███▎      | 1147/3507 [27:53<50:17,  1.28s/it]tensor([[-4.1562, -3.6719, -0.7812,  2.5625, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -2.7656,  0.8398,  0.2578, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:39,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.38 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -3.7812,  0.0889, -0.4004, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2344,  0.3770,  2.3438, -1.2812, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -2.6562,  1.3672,  0.0747, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -1.9062,  1.5547,  0.3555, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -2.6250,  0.8477,  2.4219, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -1.4141,  1.0469, -0.9062, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:40,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 18:12:40,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.37 | bwd_microstep: 3.90 | bwd_inner_microstep: 2.76 | bwd_allreduce_microstep: 0.97 | step_microstep: 2.61
[2025-11-06 18:12:40,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.81 | bwd: 4.70 | bwd_inner: 3.50 | bwd_allreduce: 0.99 | step: 2.69
 33%|███▎      | 1148/3507 [27:54<51:05,  1.30s/it]                                                   {'loss': 0.2725, 'learning_rate': 1.571451626793081e-05, 'epoch': 0.33}
 33%|███▎      | 1148/3507 [27:54<51:05,  1.30s/it]tensor([[-5.8750, -5.2812, -1.6562,  1.9219, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2188, -0.0187,  3.6406, -0.9570, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -4.2500, -1.1016,  1.9922, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:41,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0781, -2.5312, -0.2812,  1.7031, -1.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -3.1406,  0.4902,  1.6797, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -2.7500,  1.3594, -0.1777, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -2.2656,  2.2969,  0.5938, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -3.3906,  1.7500, -1.3125, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:12:42,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.18 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:12:42,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.42 | bwd_microstep: 1299.69 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1298.57 | step_microstep: 3.22
[2025-11-06 18:12:42,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.90 | bwd: 1300.56 | bwd_inner: 1.81 | bwd_allreduce: 1298.61 | step: 3.30
 33%|███▎      | 1149/3507 [27:56<58:23,  1.49s/it]                                                   {'loss': 0.1278, 'learning_rate': 1.5706933403912415e-05, 'epoch': 0.33}
 33%|███▎      | 1149/3507 [27:56<58:23,  1.49s/it]tensor([[-4.8125, -4.2500, -0.9961,  2.3750, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:42,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.51 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8906, -1.0078,  2.5312, -1.5000, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8906, -2.5781, -0.4219,  3.0781, -1.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -2.2656,  2.1719, -0.8008, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -3.5625, -0.8516,  1.8281, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8828,  0.6172,  3.7969,  1.0547, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -0.4434,  3.0000, -1.4766, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -2.8594,  0.6250,  1.3203, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:44,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.22 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:12:44,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.45 | bwd_microstep: 1.90 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 3.57
[2025-11-06 18:12:44,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.95 | bwd: 2.85 | bwd_inner: 1.95 | bwd_allreduce: 0.78 | step: 3.66
 33%|███▎      | 1150/3507 [27:58<59:06,  1.50s/it]                                                   {'loss': 0.8214, 'learning_rate': 1.569934567033925e-05, 'epoch': 0.33}
 33%|███▎      | 1150/3507 [27:58<59:06,  1.50s/it]tensor([[-4.2500, -4.2188, -1.9453,  2.0000, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -1.0859,  2.0781, -1.0391, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -3.3125,  0.7773,  0.8320, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:44,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.71 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8750, -3.6719,  0.0571,  2.0312, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -1.5078,  2.1094, -2.8438, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -5.0312, -1.3438,  1.6562, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -4.3750, -0.2871,  1.9062, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.7500, -0.3633,  2.5469, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:12:45,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.66 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:12:45,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.72 | bwd_microstep: 1041.30 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1040.06 | step_microstep: 5.03
[2025-11-06 18:12:45,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.46 | bwd: 1042.08 | bwd_inner: 1.84 | bwd_allreduce: 1040.10 | step: 5.10
 33%|███▎      | 1151/3507 [27:59<58:20,  1.49s/it]                                                   {'loss': 0.1579, 'learning_rate': 1.5691753073685692e-05, 'epoch': 0.33}
 33%|███▎      | 1151/3507 [27:59<58:20,  1.49s/it]tensor([[-4.4375, -1.9531,  1.3594, -1.8281, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -2.4844,  0.4629,  3.2656, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:45,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.01 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -2.4844, -0.1738, -1.6250, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1719, -0.8594,  1.1250, -2.1406, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4062, -3.4219, -0.0369,  1.7656, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -1.3438,  2.8281, -1.5000, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -4.5312, -0.9219,  1.3984, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2031, -0.0703,  2.8750,  1.6797, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:12:47,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:12:47,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.77 | bwd_microstep: 147.82 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 146.64 | step_microstep: 1.84
[2025-11-06 18:12:47,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.81 | bwd: 148.64 | bwd_inner: 1.84 | bwd_allreduce: 146.67 | step: 1.92
 33%|███▎      | 1152/3507 [28:00<56:56,  1.45s/it]                                                   {'loss': 0.4085, 'learning_rate': 1.568415562043028e-05, 'epoch': 0.33}
 33%|███▎      | 1152/3507 [28:00<56:56,  1.45s/it]tensor([[-4.9062, -1.7734,  2.6875, -1.2734, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -2.8906,  0.3496,  1.5859, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:47,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.98 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -3.4688,  0.1128,  1.7344, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -2.5938,  1.1094,  1.3906, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -2.2812,  1.2734, -0.5547, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -3.0000,  1.9375, -1.1250, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -4.2812, -0.7109,  1.7656, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -1.1172,  2.4375, -0.1670, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:12:48,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:12:48,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.84 | bwd_microstep: 1308.78 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1307.64 | step_microstep: 1.86
[2025-11-06 18:12:48,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.84 | bwd: 1309.79 | bwd_inner: 1.99 | bwd_allreduce: 1307.68 | step: 1.94
 33%|███▎      | 1153/3507 [28:02<1:00:02,  1.53s/it]                                                     {'loss': 0.2239, 'learning_rate': 1.5676553317055694e-05, 'epoch': 0.33}
 33%|███▎      | 1153/3507 [28:02<1:00:02,  1.53s/it]tensor([[-4.3125, -2.4062,  1.2812,  0.6328, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.0625, -0.3398,  0.7695, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -4.0938, -1.0156,  2.3750, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:48,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.65 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2500, -1.6094,  1.6875,  2.0938, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1094,  1.0469,  3.3594, -2.2344, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5938, -3.1719, -0.5234,  2.5469, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -4.6562, -0.9727,  1.0703, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -2.7812, -0.1650,  2.2812, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:50,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:50,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.46 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.95
[2025-11-06 18:12:50,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.13 | bwd: 2.87 | bwd_inner: 1.94 | bwd_allreduce: 0.80 | step: 2.03
 33%|███▎      | 1154/3507 [28:04<59:03,  1.51s/it]                                                     {'loss': 0.3476, 'learning_rate': 1.5668946170048746e-05, 'epoch': 0.33}
 33%|███▎      | 1154/3507 [28:04<59:03,  1.51s/it]tensor([[-4.5000, -2.7188,  0.7695,  0.4492, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -3.8438, -0.3555,  0.9336, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:50,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.70 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.4375, -3.7500, -0.1387, -3.4688, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -2.4531,  2.2344,  0.0850, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -1.4453,  2.3906,  1.6484, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -1.8203,  1.5469,  0.7383, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -0.0173,  2.7812, -1.7969, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0000, -3.5938,  0.1260,  1.3594, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:50,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:12:50,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.53 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.61
[2025-11-06 18:12:50,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.25 | bwd: 3.01 | bwd_inner: 2.01 | bwd_allreduce: 0.87 | step: 1.69
 33%|███▎      | 1155/3507 [28:04<48:23,  1.23s/it]                                                   {'loss': 0.7637, 'learning_rate': 1.566133418590039e-05, 'epoch': 0.33}
 33%|███▎      | 1155/3507 [28:04<48:23,  1.23s/it]tensor([[-2.7812,  0.3496,  3.3438, -1.4531, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.6250, -6.6562, -1.4844, -1.3516, -6.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.0938, -0.5391,  1.8594, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:51,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.47 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5000, -2.4844,  2.2656, -1.2891, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -4.1875,  0.0640,  1.7578, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -1.8203,  2.2344,  0.7188, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125, -0.2793,  2.3594, -1.1406, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -3.4219, -0.5938,  1.7031, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:52,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.25
[2025-11-06 18:12:52,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.69 | bwd_microstep: 2.39 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 1.06 | step_microstep: 1.94
[2025-11-06 18:12:52,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.18 | bwd: 3.36 | bwd_inner: 2.09 | bwd_allreduce: 1.10 | step: 2.03
 33%|███▎      | 1156/3507 [28:06<55:27,  1.42s/it]                                                   {'loss': 0.1809, 'learning_rate': 1.5653717371105702e-05, 'epoch': 0.33}
 33%|███▎      | 1156/3507 [28:06<55:27,  1.42s/it]tensor([[-5.3438, -4.0312, -0.2988,  1.0625, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -3.3750,  0.1167,  1.1875, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:52,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.80 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7969, -0.9023,  2.6875, -0.9297, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -1.5859,  2.9062, -0.3555, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.4062,  0.5039,  2.1094, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -1.0625,  3.3125, -2.2656, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -3.2656,  1.5938, -0.5039, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -4.4688, -0.4258,  0.9883, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:53,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.16 | optimizer_step: 0.22
[2025-11-06 18:12:53,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.42 | bwd_microstep: 2.14 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.94 | step_microstep: 2.87
[2025-11-06 18:12:53,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.24 | bwd: 3.09 | bwd_inner: 1.99 | bwd_allreduce: 0.98 | step: 2.96
 33%|███▎      | 1157/3507 [28:07<53:44,  1.37s/it]                                                   {'loss': 0.4144, 'learning_rate': 1.564609573216388e-05, 'epoch': 0.33}
 33%|███▎      | 1157/3507 [28:07<53:44,  1.37s/it]tensor([[-5.0312, -2.3594,  1.2344, -1.8125, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9062, -0.1143,  3.1719,  3.0625, -1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.4844, -0.0488,  1.0391, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:54,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.51 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -1.6016,  2.0156, -0.3223, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -2.5938, -0.1924,  2.6094, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -4.5938, -1.5547,  1.3828, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.1406,  0.2832,  1.9531, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -3.8125, -2.1094,  1.7578, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:55,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:12:55,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.92 | bwd_microstep: 2.21 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1.00 | step_microstep: 3.13
[2025-11-06 18:12:55,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.46 | bwd: 2.92 | bwd_inner: 1.72 | bwd_allreduce: 1.04 | step: 3.21
 33%|███▎      | 1158/3507 [28:09<58:33,  1.50s/it]                                                   {'loss': 0.5008, 'learning_rate': 1.5638469275578244e-05, 'epoch': 0.33}
 33%|███▎      | 1158/3507 [28:09<58:33,  1.50s/it]tensor([[-5.2500, -3.9375, -0.1758,  0.9062, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:55,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.91 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8125, -3.4219,  0.2178,  1.1484, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -3.5938,  0.4531, -2.2188, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -4.4062, -2.3125,  1.6406, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -0.9727,  3.2344, -1.6406, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.7969, -0.5859,  2.2031, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -1.2266,  2.5312,  0.1914, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -2.4375,  1.7266, -0.9648, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:56,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:12:56,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.95 | bwd_microstep: 2.23 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.91 | step_microstep: 1.78
[2025-11-06 18:12:56,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.88 | bwd: 3.09 | bwd_inner: 2.00 | bwd_allreduce: 0.94 | step: 1.87
 33%|███▎      | 1159/3507 [28:10<48:18,  1.23s/it]                                                   {'loss': 0.1352, 'learning_rate': 1.5630838007856214e-05, 'epoch': 0.33}
 33%|███▎      | 1159/3507 [28:10<48:18,  1.23s/it]tensor([[-3.1562, -2.9219, -1.0703,  1.7812, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -3.8750, -0.7383,  2.1250, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2812, -5.1875, -1.3281,  0.8711, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:56,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.93 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2188, -5.8438, -2.5000,  1.2422, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -1.9844,  1.7188, -0.2559, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125, -1.1562,  0.8281, -1.7109, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.8281,  0.1436,  2.0312, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -3.7656,  0.0938,  1.0859, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:12:58,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:58,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.78 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.28
[2025-11-06 18:12:59,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.73 | bwd: 2.88 | bwd_inner: 1.89 | bwd_allreduce: 0.87 | step: 2.36
 33%|███▎      | 1160/3507 [28:12<1:04:53,  1.66s/it]                                                     {'loss': 0.3947, 'learning_rate': 1.5623201935509322e-05, 'epoch': 0.33}
 33%|███▎      | 1160/3507 [28:12<1:04:53,  1.66s/it]tensor([[-5.9375, -4.1875, -0.1147,  0.0791, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -4.2812, -1.3984,  1.1484, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125, -0.9023,  2.3906,  0.3770, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:59,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.59 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.5625, -2.2031,  2.0312,  0.3711, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -2.9531,  1.6875,  0.1787, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2500,  0.2295,  2.6875, -0.6719, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6250, -1.6016,  2.1875,  1.9141, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -1.0938,  2.8594, -0.8359, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:12:59,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:12:59,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.38 | bwd_microstep: 146.55 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 145.42 | step_microstep: 1.68
[2025-11-06 18:12:59,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.99 | bwd: 147.37 | bwd_inner: 1.80 | bwd_allreduce: 145.45 | step: 1.75
 33%|███▎      | 1161/3507 [28:13<51:18,  1.31s/it]                                                     {'loss': 0.2785, 'learning_rate': 1.5615561065053208e-05, 'epoch': 0.33}
 33%|███▎      | 1161/3507 [28:13<51:18,  1.31s/it]tensor([[-2.7969, -2.2812,  0.1973,  3.0625, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2656,  0.2480,  2.4219, -1.0625, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5312, -3.9219,  0.6250, -1.4766, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:12:59,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.02 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-9.1875, -7.9062, -2.8750, -0.3867, -6.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -1.5547,  1.6719, -0.3164, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -2.2500,  2.1875, -0.2373, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8594, -2.8125, -1.3828,  1.4531, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -2.4688, -0.6484,  1.5703, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:00,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:13:00,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.62 | bwd_microstep: 2.08 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.57
[2025-11-06 18:13:00,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.66 | bwd: 3.03 | bwd_inner: 2.02 | bwd_allreduce: 0.89 | step: 2.65
 33%|███▎      | 1162/3507 [28:14<50:09,  1.28s/it]                                                   {'loss': 0.126, 'learning_rate': 1.560791540300758e-05, 'epoch': 0.33}
 33%|███▎      | 1162/3507 [28:14<50:09,  1.28s/it]tensor([[-3.4219, -3.0156, -0.7109,  2.0781, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:00,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.16 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -2.7031,  0.8477,  1.5781, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.7969, -0.4004,  1.9922, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -3.7500,  1.1094,  1.1328, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -0.5859,  3.1719, -1.3750, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -4.0625, -0.3242, -0.1338, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -3.4531, -1.3594,  2.2969, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -4.4375, -2.0469,  1.5156, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:02,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:13:02,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.91 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.94 | step_microstep: 2.16
[2025-11-06 18:13:02,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.07 | bwd: 2.88 | bwd_inner: 1.72 | bwd_allreduce: 0.98 | step: 2.24
 33%|███▎      | 1163/3507 [28:15<51:31,  1.32s/it]                                                   {'loss': 0.3726, 'learning_rate': 1.5600264955896273e-05, 'epoch': 0.33}
 33%|███▎      | 1163/3507 [28:15<51:31,  1.32s/it]tensor([[-4.2500, -1.6016,  2.6406,  0.4062, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -1.6250,  2.7031, -0.4883, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -3.2188,  0.3848,  1.4297, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:02,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.03 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0312, -3.6250,  0.0815,  1.2109, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -2.9688, -0.2715,  1.7578, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -4.6875, -2.2812,  1.3359, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -1.2109,  3.3750, -1.1094, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -0.9844,  2.4688,  0.5664, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:03,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:13:03,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.56 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.82
[2025-11-06 18:13:03,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.61 | bwd: 2.76 | bwd_inner: 1.80 | bwd_allreduce: 0.85 | step: 1.90
 33%|███▎      | 1164/3507 [28:17<51:49,  1.33s/it]                                                   {'loss': 0.147, 'learning_rate': 1.5592609730247167e-05, 'epoch': 0.33}
 33%|███▎      | 1164/3507 [28:17<51:49,  1.33s/it]tensor([[-3.4844, -2.7969,  0.0056,  2.8281, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.5000, -2.4531,  0.4629,  2.2500, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:03,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.87 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -3.6250, -0.2393,  2.7969, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -2.1562,  0.7227,  3.5156, -1.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -2.7344,  1.2891,  1.0234, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0938, -5.3125, -0.2207,  0.9453, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2969, -2.4219,  0.5547,  2.6250, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.2656,  2.3906,  0.0107, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:05,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:13:05,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.95 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.72 | step_microstep: 2.09
[2025-11-06 18:13:05,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.85 | bwd: 2.80 | bwd_inner: 1.93 | bwd_allreduce: 0.76 | step: 2.18
 33%|███▎      | 1165/3507 [28:19<58:22,  1.50s/it]                                                   {'loss': 1.1037, 'learning_rate': 1.558494973259224e-05, 'epoch': 0.33}
 33%|███▎      | 1165/3507 [28:19<58:22,  1.50s/it]tensor([[-5.0312, -2.4375,  2.2656,  0.3906, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6875, -5.6250, -2.6562, -4.4688, -6.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.3125, -2.9688,  2.1875, -1.9297, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4766,  1.2500,  3.1406, -0.7852, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:13:05,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.55 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.9062, -2.5156,  1.5391,  0.1719, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -3.7500,  0.1826, -0.1328, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0469, -2.7344, -0.0771,  3.8906, -1.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -2.8281,  0.4824,  2.1562, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:13:05,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:13:05,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.63 | bwd_microstep: 61.39 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 60.00 | step_microstep: 1.51
[2025-11-06 18:13:05,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.21 | bwd: 62.28 | bwd_inner: 2.10 | bwd_allreduce: 60.04 | step: 1.59
 33%|███▎      | 1166/3507 [28:19<45:38,  1.17s/it]                                                   {'loss': 0.9768, 'learning_rate': 1.5577284969467545e-05, 'epoch': 0.33}
 33%|███▎      | 1166/3507 [28:19<45:38,  1.17s/it]tensor([[-6.3750, -6.1250, -3.0000,  1.1250, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.2188,  2.1562,  0.3633, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -3.1094,  1.9609,  0.2559, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.5312, -3.3594,  0.1787,  2.1562, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([3], device='cuda:1')
tensor([[-5.4375, -4.6875, -1.0391,  2.1094, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062,  0.4941,  3.5781, -2.2656, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:06,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.34 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.9141,  0.7344,  2.4375, -1.6719, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[h264 @ 0xd13b1c0] mmco: unref short failure
tensor([[-4.6875, -4.0000, -0.8516,  1.9766, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:08,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 18:13:08,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.88 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.28
[2025-11-06 18:13:08,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.22 | bwd: 2.84 | bwd_inner: 1.90 | bwd_allreduce: 0.82 | step: 2.37
 33%|███▎      | 1167/3507 [28:22<1:02:05,  1.59s/it]                                                     {'loss': 0.3222, 'learning_rate': 1.5569615447413186e-05, 'epoch': 0.33}
 33%|███▎      | 1167/3507 [28:22<1:02:05,  1.59s/it]tensor([[-4.3750, -2.6875,  0.8945,  1.1016, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7344, -2.3281,  0.5625,  0.9258, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:08,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.75 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.2188, -3.4844,  0.7578,  1.5000, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -1.8984,  0.4297,  0.5273, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -3.7031,  0.0640,  0.0845, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -2.6250,  0.2402,  2.4688, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8594, -3.6562, -1.1953,  2.5469, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -3.4531, -0.8398,  3.5469, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:13:08,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:13:08,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.21 | bwd_microstep: 147.22 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 146.17 | step_microstep: 1.43
[2025-11-06 18:13:08,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.99 | bwd: 148.10 | bwd_inner: 1.75 | bwd_allreduce: 146.21 | step: 1.52
 33%|███▎      | 1168/3507 [28:22<50:11,  1.29s/it]                                                     {'loss': 0.4005, 'learning_rate': 1.5561941172973336e-05, 'epoch': 0.33}
 33%|███▎      | 1168/3507 [28:22<50:11,  1.29s/it]tensor([[-4.1250, -0.8281,  3.4688, -0.9258, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.2344,  0.4746,  0.8320, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -4.4375, -0.9023,  2.7344, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:09,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.89 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.3125, -6.5000, -2.8750, -0.2930, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -2.1875,  1.8047,  0.8086, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -3.4531, -0.8008,  2.2031, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.9375, -3.5625,  0.1328,  1.4453, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -0.7930,  3.4844, -0.8242, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:10,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:13:10,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.44 | bwd_microstep: 1.74 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.10
[2025-11-06 18:13:10,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.35 | bwd: 2.62 | bwd_inner: 1.73 | bwd_allreduce: 0.77 | step: 2.20
 33%|███▎      | 1169/3507 [28:24<53:57,  1.38s/it]                                                   {'loss': 0.7296, 'learning_rate': 1.555426215269623e-05, 'epoch': 0.33}
 33%|███▎      | 1169/3507 [28:24<53:57,  1.38s/it]tensor([[-4.0625, -0.9766,  3.2812, -0.3828, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -3.7656, -0.6562,  2.2188, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5938, -3.5938, -1.3672,  2.7656, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:10,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.90 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4688, -0.6602,  2.1406, -2.2344, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.6406,  1.0703,  1.6094, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -3.9844, -1.3281,  1.9219, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -1.6641,  2.1719,  0.5234, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.3438, -0.4902,  1.8594, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:10,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 18:13:10,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.78 | bwd_microstep: 57.35 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 56.14 | step_microstep: 1.82
[2025-11-06 18:13:10,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.70 | bwd: 58.06 | bwd_inner: 1.75 | bwd_allreduce: 56.17 | step: 1.89
 33%|███▎      | 1170/3507 [28:24<42:13,  1.08s/it]                                                   {'loss': 0.1303, 'learning_rate': 1.554657839313413e-05, 'epoch': 0.33}
 33%|███▎      | 1170/3507 [28:24<42:13,  1.08s/it]tensor([[-1.1953,  1.2422,  2.7031, -1.0938, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0156, -2.1250, -0.8789,  2.5781, -0.5742]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.9219, -0.2109,  1.2734, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:11,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -2.5781,  1.7734, -0.4160, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9844,  0.5273,  2.5000, -1.2812, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375e+00, -3.0938e+00,  1.4609e+00,  1.9989e-03, -4.2500e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -0.9492,  3.1250, -1.6328, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -2.5000,  1.4531, -0.0635, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:12,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:13:12,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.50 | bwd_microstep: 1.82 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.24
[2025-11-06 18:13:12,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.06 | bwd: 2.60 | bwd_inner: 1.65 | bwd_allreduce: 0.82 | step: 2.31
 33%|███▎      | 1171/3507 [28:26<47:49,  1.23s/it]                                                   {'loss': 0.3538, 'learning_rate': 1.553888990084338e-05, 'epoch': 0.33}
 33%|███▎      | 1171/3507 [28:26<47:49,  1.23s/it]tensor([[-4.5938, -4.6875, -2.8281,  0.8438, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -3.5938, -1.5234,  1.2031, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:12,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.91 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.2500, -1.6172,  1.7812, -1.2109, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3906, -1.0312,  2.2656,  0.2080, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -4.3125, -0.5742,  2.1562, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -4.7188, -0.3633,  2.1406, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -2.5625,  1.0234,  1.6172, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -4.3750, -1.9297,  1.9688, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:13:12,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.25 | optimizer_step: 0.21
[2025-11-06 18:13:12,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.34 | bwd_microstep: 27.98 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 26.78 | step_microstep: 1.97
[2025-11-06 18:13:12,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.26 | bwd: 28.82 | bwd_inner: 1.85 | bwd_allreduce: 26.82 | step: 2.07
 33%|███▎      | 1172/3507 [28:26<37:54,  1.03it/s]                                                   {'loss': 0.4644, 'learning_rate': 1.553119668238432e-05, 'epoch': 0.33}
 33%|███▎      | 1172/3507 [28:26<37:54,  1.03it/s]tensor([[-2.8438, -0.1953,  2.6562, -0.4844, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -2.6406,  1.1250,  1.2734, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:13,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.23 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0625, -0.8750,  2.5781, -2.3125, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -4.2188, -1.1250,  1.9375, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -1.4531,  2.7812, -0.0510, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -3.9844,  0.7031,  0.7148, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1250, -0.3945,  2.9219, -0.5508, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -2.7031,  1.6328,  1.2109, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:15,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:13:15,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.85 | bwd_microstep: 1.88 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.83 | step_microstep: 86.72
[2025-11-06 18:13:15,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.10 | bwd: 2.75 | bwd_inner: 1.76 | bwd_allreduce: 0.85 | step: 86.79
 33%|███▎      | 1173/3507 [28:29<58:37,  1.51s/it]                                                   {'loss': 0.3474, 'learning_rate': 1.5523498744321352e-05, 'epoch': 0.33}
 33%|███▎      | 1173/3507 [28:29<58:37,  1.51s/it]tensor([[-3.2344, -0.6250,  3.0469,  0.8281, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8672, -2.0312, -0.3379,  4.0000, -0.2871]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -1.8203,  1.0625, -1.4375, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:15,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.98 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4688, -1.2188,  2.1406,  0.2002, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -1.3984,  3.0312, -0.5664, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -3.4531,  1.1406,  0.9570, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -3.6406,  0.1953,  2.4688, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656, -0.4238,  2.7031, -1.0391, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:13:16,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:13:16,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.33 | bwd_microstep: 230.75 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 229.57 | step_microstep: 1.59
[2025-11-06 18:13:16,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.34 | bwd: 231.49 | bwd_inner: 1.77 | bwd_allreduce: 229.60 | step: 1.67
 33%|███▎      | 1174/3507 [28:30<48:15,  1.24s/it]                                                   {'loss': 0.1631, 'learning_rate': 1.551579609322289e-05, 'epoch': 0.33}
 33%|███▎      | 1174/3507 [28:30<48:15,  1.24s/it]tensor([[-2.7656, -2.6562, -0.5078,  3.4062, -1.1016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938, -2.5469, -0.6523,  3.0312, -1.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.2188, -1.3203,  2.0781, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:16,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.91 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.4062, -3.0938,  0.4180,  1.8125, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5000, -2.5469, -1.5156,  1.2422, -1.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -5.2812, -1.9688,  1.4531, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -2.1250,  1.4766,  0.9102, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -1.6562,  2.2188,  0.0889, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:18,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:13:18,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.17
[2025-11-06 18:13:18,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.61 | bwd: 2.64 | bwd_inner: 1.68 | bwd_allreduce: 0.85 | step: 2.24
 34%|███▎      | 1175/3507 [28:32<1:00:31,  1.56s/it]                                                     {'loss': 0.2207, 'learning_rate': 1.5508088735661378e-05, 'epoch': 0.34}
 34%|███▎      | 1175/3507 [28:32<1:00:31,  1.56s/it]tensor([[-3.2188, -0.6172,  2.4531, -0.1289, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.2969, -1.1094,  1.8047,  0.0781, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-5.4375, -5.0625, -2.2031,  1.2578, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:18,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.59 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.4844, -3.2969, -0.7148,  3.5469, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.8594,  1.2734,  1.4531, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -1.8438,  1.8906,  0.8359, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9688, -2.6562,  0.4863,  1.6641, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -3.7500, -1.9062,  1.5156, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:13:18,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:13:18,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.97 | bwd_microstep: 48.18 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 47.01 | step_microstep: 1.64
[2025-11-06 18:13:18,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.58 | bwd: 49.19 | bwd_inner: 2.02 | bwd_allreduce: 47.05 | step: 1.73
 34%|███▎      | 1176/3507 [28:32<47:28,  1.22s/it]                                                     {'loss': 0.3619, 'learning_rate': 1.550037667821327e-05, 'epoch': 0.34}
 34%|███▎      | 1176/3507 [28:32<47:28,  1.22s/it]tensor([[-3.7656, -1.2344,  2.5000,  0.3672, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1875, -0.1719,  2.2812, -2.5781, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3125,  0.3770,  2.8906, -0.6289, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3438, -3.7656, -0.8477,  2.0469, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:13:19,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.42 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([3], device='cuda:2')
tensor([[-0.0444,  0.3945,  2.1875,  5.0000,  0.9023]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -3.4219, -0.0981,  2.9375, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938e+00, -1.4141e+00,  2.6719e+00, -1.4420e-03, -3.4219e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8047,  0.9141,  4.3125,  1.2578, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:20,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:13:20,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.63 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.93 | step_microstep: 3.20
[2025-11-06 18:13:20,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 447.06 | bwd: 2.97 | bwd_inner: 1.89 | bwd_allreduce: 0.96 | step: 3.27
 34%|███▎      | 1177/3507 [28:34<51:11,  1.32s/it]                                                   {'loss': 0.4734, 'learning_rate': 1.5492659927459033e-05, 'epoch': 0.34}
 34%|███▎      | 1177/3507 [28:34<51:11,  1.32s/it]tensor([[-5.3438, -4.9688, -1.7266,  2.5156, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -3.8594, -1.4297,  2.9219, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -3.7969,  0.0215, -3.3750, -5.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:13:20,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.70 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.5312, -5.7188, -2.0156,  0.7070, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -2.2500,  2.2344, -0.9414, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0312, -4.2500, -1.1641,  1.3984, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -0.0991,  2.3594, -1.7578, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4688,  0.6094,  3.1875, -1.7031, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:13:21,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:13:21,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.07 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.55
[2025-11-06 18:13:21,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.80 | bwd: 2.94 | bwd_inner: 1.83 | bwd_allreduce: 0.96 | step: 2.65
 34%|███▎      | 1178/3507 [28:35<46:02,  1.19s/it]                                                   {'loss': 1.4234, 'learning_rate': 1.5484938489983144e-05, 'epoch': 0.34}
 34%|███▎      | 1178/3507 [28:35<46:02,  1.19s/it]tensor([[-4.5938, -1.8125,  1.6406, -1.7969, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -3.0000, -0.1855,  1.9922, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:21,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.36 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.8438, -4.2812,  0.1309, -1.8516, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -4.0000, -0.6875,  2.2344, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -2.9531,  0.0977,  2.3438, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -3.8594, -0.6094,  1.2109, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -2.1250,  2.3281,  1.0000, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0938, -1.9062,  0.9844,  1.9609, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:26,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 18:13:26,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.50 | bwd_microstep: 3355.34 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 3354.15 | step_microstep: 2.27
[2025-11-06 18:13:26,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.89 | bwd: 3356.18 | bwd_inner: 1.84 | bwd_allreduce: 3354.21 | step: 2.35
 34%|███▎      | 1179/3507 [28:40<1:32:23,  2.38s/it]                                                     {'loss': 0.6073, 'learning_rate': 1.547721237237407e-05, 'epoch': 0.34}
 34%|███▎      | 1179/3507 [28:40<1:32:23,  2.38s/it]tensor([[-3.8750, -2.1719,  1.1719,  1.2812, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8438, -0.6602,  2.2344,  0.4941, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -0.9844,  2.3750, -0.7109, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -5.0000, -1.7344,  2.5000, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:26,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.63 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[-4.0938, -1.4688,  2.5000,  0.1055, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.5000, -1.0859,  1.7500, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -4.0312, -0.5117,  1.1797, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -5.0000, -1.8203,  1.5859, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:26,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:13:26,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.34 | bwd_microstep: 1.63 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.79
[2025-11-06 18:13:26,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.99 | bwd: 2.57 | bwd_inner: 1.72 | bwd_allreduce: 0.72 | step: 1.85
 34%|███▎      | 1180/3507 [28:40<1:08:52,  1.78s/it]                                                     {'loss': 0.1874, 'learning_rate': 1.5469481581224274e-05, 'epoch': 0.34}
 34%|███▎      | 1180/3507 [28:40<1:08:52,  1.78s/it]tensor([[-3.9375, -0.9219,  3.2969,  0.1465, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -3.0312,  0.5000,  0.6055, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:27,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.35 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.8438, -5.4062,  0.1118, -0.4688, -5.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -1.9766,  1.9844,  0.4121, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -4.0938, -1.3906,  2.0312, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -4.1875, -0.5195,  0.8438, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6094,  1.1875,  3.0625, -1.6719, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -4.1875, -1.1953,  1.8750, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:28,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:13:28,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.63 | bwd_microstep: 1130.12 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 1129.13 | step_microstep: 2.07
[2025-11-06 18:13:28,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.01 | bwd: 1131.01 | bwd_inner: 1.73 | bwd_allreduce: 1129.16 | step: 2.15
 34%|███▎      | 1181/3507 [28:42<1:05:35,  1.69s/it]                                                     {'loss': 0.2367, 'learning_rate': 1.5461746123130202e-05, 'epoch': 0.34}
 34%|███▎      | 1181/3507 [28:42<1:05:35,  1.69s/it]tensor([[-5.1875, -3.3438,  0.9570,  1.4141, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:28,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.57 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.25
tensor([[-3.8281, -2.8906,  0.2891,  3.0312, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -3.8125, -1.3125,  1.7891, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -1.8906,  2.1094, -0.9961, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -3.2812,  0.6914, -0.8398, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -1.8125,  1.5859,  3.4062, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -3.8906, -0.6211,  2.0625, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -1.6875,  1.8047,  1.5000, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:28,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:13:28,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 100.30 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 99.28 | step_microstep: 1.49
[2025-11-06 18:13:28,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.09 | bwd: 101.57 | bwd_inner: 2.04 | bwd_allreduce: 99.35 | step: 1.73
 34%|███▎      | 1182/3507 [28:42<51:53,  1.34s/it]                                                     {'loss': 0.3117, 'learning_rate': 1.54540060046923e-05, 'epoch': 0.34}
 34%|███▎      | 1182/3507 [28:42<51:53,  1.34s/it]tensor([[-5.0625, -3.8750, -0.1079,  1.7969, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:29,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.91 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1562, -3.7500,  1.0312,  0.0364, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -1.9688,  2.5312,  0.2021, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -1.1016,  3.0938,  0.6094, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -2.0781,  1.3516, -3.5312, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.8438, -0.6172,  2.5000, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8438, -3.7500, -1.5156,  2.6094, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7188, -0.8281,  2.6094, -0.7773, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:13:30,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:13:30,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.24 | bwd_microstep: 1178.76 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1177.47 | step_microstep: 1.58
[2025-11-06 18:13:30,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.17 | bwd: 1179.62 | bwd_inner: 1.99 | bwd_allreduce: 1177.50 | step: 1.65
 34%|███▎      | 1183/3507 [28:44<54:17,  1.40s/it]                                                   {'loss': 0.3457, 'learning_rate': 1.544626123251497e-05, 'epoch': 0.34}
 34%|███▎      | 1183/3507 [28:44<54:17,  1.40s/it]tensor([[-2.8438, -0.6523,  2.0938, -0.3848, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:30,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.31 | bwd_microstep: 1.28 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5312, -3.2812,  0.3555,  2.2812, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -1.9297,  1.9922,  0.3223, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -4.0625,  0.7266,  0.4004, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969, -0.8867,  2.1875, -0.2754, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -2.6094, -0.1406,  1.0938, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -1.8438,  2.0312, -1.5938, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -3.2812, -1.1406,  3.4219, -1.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:30,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:13:30,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.59 | bwd_microstep: 48.40 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 46.97 | step_microstep: 1.83
[2025-11-06 18:13:30,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.92 | bwd: 49.67 | bwd_inner: 2.52 | bwd_allreduce: 47.01 | step: 1.90
 34%|███▍      | 1184/3507 [28:44<42:44,  1.10s/it]                                                   {'loss': 0.1941, 'learning_rate': 1.5438511813206596e-05, 'epoch': 0.34}
 34%|███▍      | 1184/3507 [28:44<42:44,  1.10s/it]tensor([[-4.2188, -2.7500,  0.8555,  1.8828, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:31,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.15 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7500, -0.1758,  2.8281, -0.4512, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9844,  0.4824,  4.0312, -1.6406, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -4.0938,  0.1924,  2.0000, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4219, -0.2451,  2.9531,  1.0625, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -4.5000, -0.5508,  1.5156, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -1.8359,  2.4219, -1.1719, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.1250,  0.2334,  1.7656, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:33,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.25 | optimizer_step: 0.22
[2025-11-06 18:13:33,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.61 | bwd_microstep: 2560.24 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 2559.29 | step_microstep: 2.60
[2025-11-06 18:13:33,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.79 | bwd: 2561.10 | bwd_inner: 1.59 | bwd_allreduce: 2559.34 | step: 2.69
 34%|███▍      | 1185/3507 [28:47<1:03:30,  1.64s/it]                                                     {'loss': 0.3828, 'learning_rate': 1.5430757753379527e-05, 'epoch': 0.34}
 34%|███▍      | 1185/3507 [28:47<1:03:30,  1.64s/it]tensor([[-6.7188, -4.2812,  0.8789,  0.0476, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.5938, -3.4062,  0.3203, -1.2344, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([2], device='cuda:0')
[2025-11-06 18:13:33,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.15 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9688, -4.6250, -0.9766,  0.2354, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6406,  0.2480,  2.5625,  0.9258, -1.3047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -2.6250,  1.6406,  0.9609, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9922, -2.2656, -1.2031,  2.7031, -0.4473]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.0312, -1.3281,  2.0469,  2.6562, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.5156, -0.0134,  2.6562, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:13:34,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:13:34,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.85 | bwd_microstep: 128.30 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 127.14 | step_microstep: 1.92
[2025-11-06 18:13:34,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.03 | bwd: 129.25 | bwd_inner: 1.94 | bwd_allreduce: 127.18 | step: 2.00
 34%|███▍      | 1186/3507 [28:48<50:14,  1.30s/it]                                                     {'loss': 0.7411, 'learning_rate': 1.5422999059650064e-05, 'epoch': 0.34}
 34%|███▍      | 1186/3507 [28:48<50:14,  1.30s/it]tensor([[-2.9219, -0.3340,  2.4688, -0.6719, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -3.3438, -0.9062,  2.5312, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:34,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.55 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -2.1719,  1.7109,  1.5859, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -2.1562,  1.3828,  1.2188, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -2.3750,  1.3516,  0.6133, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -1.7031,  2.7969, -0.8828, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -2.6250,  1.1641,  1.7031, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -3.3438, -1.1641,  1.6719, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:13:36,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.20 | optimizer_step: 0.32
[2025-11-06 18:13:36,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.34 | bwd_microstep: 2155.85 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 2154.49 | step_microstep: 2.32
[2025-11-06 18:13:36,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.92 | bwd: 2156.70 | bwd_inner: 2.04 | bwd_allreduce: 2154.53 | step: 2.40
 34%|███▍      | 1187/3507 [28:50<1:04:08,  1.66s/it]                                                     {'loss': 0.8093, 'learning_rate': 1.541523573863847e-05, 'epoch': 0.34}
 34%|███▍      | 1187/3507 [28:50<1:04:08,  1.66s/it]tensor([[-4.5000, -3.0156,  0.6953,  1.8359, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:36,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.49 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.1406, -0.1328,  3.5312,  3.1875, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -3.3750, -0.6641,  2.4219, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3750,  0.4414,  4.0938,  1.0312, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688, -2.3125, -0.3594,  3.3594, -0.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -0.6992,  1.9375, -0.7070, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -0.3223,  3.6250, -1.6562, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9531, -4.0625, -2.3125,  1.4531, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:13:37,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:13:37,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.32 | bwd_microstep: 54.40 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 53.35 | step_microstep: 1.62
[2025-11-06 18:13:37,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.83 | bwd: 55.35 | bwd_inner: 1.83 | bwd_allreduce: 53.39 | step: 1.71
 34%|███▍      | 1188/3507 [28:51<50:44,  1.31s/it]                                                     {'loss': 0.2005, 'learning_rate': 1.5407467796968957e-05, 'epoch': 0.34}
 34%|███▍      | 1188/3507 [28:51<50:44,  1.31s/it]tensor([[-4.6875, -2.9531,  0.5156,  0.5898, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -4.3438, -1.2188,  2.7969, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.0000, -0.4688,  2.0312, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:37,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.53 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8125, -3.4844, -1.0781,  2.3750, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.7656, -0.4648,  1.9922, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5000, -0.0530,  3.6094, -2.0469, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -1.3438,  2.5156, -0.0708, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0312, -1.1484,  1.6875, -2.3750, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:13:37,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:13:37,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.24 | bwd_microstep: 115.64 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 114.50 | step_microstep: 2.03
[2025-11-06 18:13:37,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 506.79 | bwd: 116.53 | bwd_inner: 1.87 | bwd_allreduce: 114.54 | step: 2.10
 34%|███▍      | 1189/3507 [28:51<43:11,  1.12s/it]                                                   {'loss': 0.1428, 'learning_rate': 1.539969524126967e-05, 'epoch': 0.34}
 34%|███▍      | 1189/3507 [28:51<43:11,  1.12s/it]tensor([[-5.0625, -3.6875, -0.4336,  0.2012, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -0.8633,  3.2031, -1.8906, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -2.7500,  1.2344,  0.8750, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -2.4375,  1.1250,  0.3027, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.8594,  1.6406, -0.1240, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:38,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.78 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.2812, -4.5312, -0.2383,  0.3242, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3906,  0.4199,  3.6250,  0.3809, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -3.7031, -1.5938,  2.1094, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:39,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:13:39,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.01 | bwd_microstep: 245.96 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 244.97 | step_microstep: 2.10
[2025-11-06 18:13:39,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.75 | bwd: 246.63 | bwd_inner: 1.51 | bwd_allreduce: 245.00 | step: 2.17
 34%|███▍      | 1190/3507 [28:52<42:24,  1.10s/it]                                                   {'loss': 0.503, 'learning_rate': 1.5391918078172698e-05, 'epoch': 0.34}
 34%|███▍      | 1190/3507 [28:52<42:24,  1.10s/it]tensor([[-4.8750, -4.0938, -0.5977,  2.3750, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.0625,  0.1484,  2.2656, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.6250, -7.0000, -1.8906, -0.0630, -6.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:39,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.09 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5781, -2.6875, -1.0312,  3.0469, -0.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -2.7656,  1.3672, -1.1172, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -2.1875,  1.1641,  0.6719, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7031, -1.4609,  0.0243,  3.1094, -0.3613]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.5938, -4.1875, -1.3047,  2.1406, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:40,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:13:40,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.61 | bwd_microstep: 929.67 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 928.63 | step_microstep: 1.87
[2025-11-06 18:13:40,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.74 | bwd: 930.37 | bwd_inner: 1.59 | bwd_allreduce: 928.66 | step: 1.94
 34%|███▍      | 1191/3507 [28:54<45:16,  1.17s/it]                                                   {'loss': 0.6351, 'learning_rate': 1.5384136314314065e-05, 'epoch': 0.34}
 34%|███▍      | 1191/3507 [28:54<45:16,  1.17s/it]tensor([[-4.0938, -2.9688,  0.4570,  2.6250, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062, -0.6797,  2.2188,  2.2344, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:40,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.46 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2969, -3.5312, -2.1562,  1.5859, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-7.6250, -6.7188, -2.3750,  0.9531, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -3.3281, -2.0469,  1.6328, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5938, -5.2500, -1.3516, -0.0383, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -2.6406,  2.0781,  0.9336, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0312, -1.3359,  1.2969,  0.8672, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:13:41,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.43 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:13:41,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.76 | bwd_microstep: 675.82 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 674.81 | step_microstep: 11.74
[2025-11-06 18:13:41,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.25 | bwd: 676.81 | bwd_inner: 1.80 | bwd_allreduce: 674.86 | step: 11.83
 34%|███▍      | 1192/3507 [28:55<44:42,  1.16s/it]                                                   {'loss': 0.6601, 'learning_rate': 1.537634995633371e-05, 'epoch': 0.34}
 34%|███▍      | 1192/3507 [28:55<44:42,  1.16s/it]tensor([[-4.0000, -1.4766,  1.6562, -1.0625, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:41,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.88 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4375, -0.6406,  2.4219, -1.3438, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.2812,  1.4766,  1.3125, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -4.2812,  0.0635,  1.8594, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -2.4219,  1.4922,  0.7266, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -2.3594,  2.0469, -0.1992, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -2.9062,  0.3086,  2.4062, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -4.2188,  0.4043,  1.9297, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:43,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:13:43,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.32 | bwd_microstep: 1980.59 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1979.55 | step_microstep: 2.27
[2025-11-06 18:13:43,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.22 | bwd: 1981.68 | bwd_inner: 1.94 | bwd_allreduce: 1979.60 | step: 2.36
 34%|███▍      | 1193/3507 [28:57<58:25,  1.52s/it]                                                   {'loss': 0.3239, 'learning_rate': 1.536855901087551e-05, 'epoch': 0.34}
 34%|███▍      | 1193/3507 [28:57<58:25,  1.52s/it]tensor([[ 0.3105,  3.0469,  4.6875,  0.3809, -0.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1250, -0.2578,  2.1562, -2.6250, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -3.1562, -0.2021,  3.0469, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:44,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.33 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -3.5781, -0.7617,  2.1875, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8438, -1.0625,  2.1719,  1.7500, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -4.8750, -0.6211,  0.3105, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.6875,  0.4258,  1.8750, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -2.5938,  1.7031, -0.2891, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:13:44,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:13:44,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.72 | bwd_microstep: 314.06 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 312.99 | step_microstep: 1.94
[2025-11-06 18:13:44,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.08 | bwd: 315.05 | bwd_inner: 1.90 | bwd_allreduce: 313.03 | step: 2.02
 34%|███▍      | 1194/3507 [28:58<49:32,  1.29s/it]                                                   {'loss': 0.5818, 'learning_rate': 1.536076348458723e-05, 'epoch': 0.34}
 34%|███▍      | 1194/3507 [28:58<49:32,  1.29s/it]tensor([[-6.0625, -4.5312, -0.6094,  0.1250, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3594,  0.4297,  3.5156, -0.1147, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:44,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.00 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.7812, -2.8438,  2.3281, -0.0513, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.7812, -0.8320,  2.9219, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.3770,  1.6797,  2.7969,  0.5859, -0.4863]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.3438, -3.4062,  0.9766,  1.4219, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -2.3906,  2.2031, -0.2852, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[h264 @ 0xc185ac0] mmco: unref short failure
tensor([[-4.5000, -4.1562, -1.3203,  2.3438, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:13:45,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:13:45,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.83 | bwd_microstep: 287.71 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 286.43 | step_microstep: 1.40
[2025-11-06 18:13:45,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.84 | bwd: 288.70 | bwd_inner: 2.10 | bwd_allreduce: 286.47 | step: 1.49
 34%|███▍      | 1195/3507 [28:59<41:56,  1.09s/it]                                                   {'loss': 0.6572, 'learning_rate': 1.5352963384120567e-05, 'epoch': 0.34}
 34%|███▍      | 1195/3507 [28:59<41:56,  1.09s/it]tensor([[-1.5938,  1.2188,  3.3281, -0.6641, -1.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -2.3125,  1.2656,  2.6406, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -1.1406,  2.2656, -0.1299, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -1.7344,  2.6094, -0.9023, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:45,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.86 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.1562, -1.6641,  1.7500,  2.7969, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.4375, -5.3438, -0.1631,  0.2715, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -4.3750, -0.5820,  1.1797, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9062, -1.7578,  2.5000, -1.4141, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:13:47,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:13:47,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.40 | bwd_microstep: 1338.11 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 1336.53 | step_microstep: 1.83
[2025-11-06 18:13:47,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.24 | bwd: 1339.43 | bwd_inner: 2.71 | bwd_allreduce: 1336.57 | step: 1.92
 34%|███▍      | 1196/3507 [29:01<55:29,  1.44s/it]                                                   {'loss': 0.197, 'learning_rate': 1.534515871613111e-05, 'epoch': 0.34}
 34%|███▍      | 1196/3507 [29:01<55:29,  1.44s/it]tensor([[-4.2500, -3.1562, -0.2041,  0.9688, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -1.0781,  2.0781, -0.8945, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -3.1406, -1.0703,  2.2969, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -1.5391,  1.0469, -1.9531, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:47,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.27 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.2500, -2.4219,  2.0000, -0.8203, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.7500, -0.3379,  2.7031, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3125, -0.7227,  2.2500, -0.5312, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -3.5312, -0.7148,  3.1094, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:47,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:13:47,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.12 | bwd_microstep: 47.27 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 46.08 | step_microstep: 2.30
[2025-11-06 18:13:47,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.42 | bwd: 48.23 | bwd_inner: 1.91 | bwd_allreduce: 46.14 | step: 2.41
 34%|███▍      | 1197/3507 [29:01<43:52,  1.14s/it]                                                   {'loss': 0.1078, 'learning_rate': 1.5337349487278346e-05, 'epoch': 0.34}
 34%|███▍      | 1197/3507 [29:01<43:52,  1.14s/it]tensor([[-3.7812, -1.4453,  1.9297, -0.2637, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -2.7500,  0.5664,  2.7344, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:48,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.60 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7188, -2.2969,  2.2500,  1.1953, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -2.1406,  1.2656,  1.2266, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -3.9062, -2.0156,  1.7031, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.9375,  1.4219,  2.2031, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -3.2344,  1.1484,  2.0469, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -0.3594,  3.2344, -0.8789, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:13:48,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:13:48,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.65 | bwd_microstep: 387.13 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 386.15 | step_microstep: 2.09
[2025-11-06 18:13:48,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.26 | bwd: 387.99 | bwd_inner: 1.66 | bwd_allreduce: 386.19 | step: 2.18
 34%|███▍      | 1198/3507 [29:02<39:08,  1.02s/it]                                                   {'loss': 0.5086, 'learning_rate': 1.532953570422566e-05, 'epoch': 0.34}
 34%|███▍      | 1198/3507 [29:02<39:08,  1.02s/it]tensor([[-3.0938, -0.8203,  2.1406,  0.3223, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:13:48,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 99.23 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0312, -3.2969,  1.6172, -0.7344, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -4.3750, -0.8867,  2.2188, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -1.3203,  2.2500,  1.2656, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2344, -0.3652,  2.8125, -0.9883, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5312, -5.0938, -1.1016, -0.2061, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -1.7734,  1.6797,  0.1201, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[18:13:50] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch20/Pilgrvideo_to_Beethoven.mp4, No such file or directory
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch20/Pilgrvideo_to_Beethoven.mp4... sharegpt4v_instruct_gpt4-vision_cap100k
Traceback (most recent call last):
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 718, in __getitem__
    ret=self.video_get_item(data_item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 604, in video_get_item
    image_list,frame_indices = self.load_video(video_path)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 582, in load_video
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/miniconda3/envs/visualquality/lib/python3.11/site-packages/decord/video_reader.py", line 57, in __init__
    raise RuntimeError("Error reading " + uri + "...")
RuntimeError: Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch20/Pilgrvideo_to_Beethoven.mp4...
tensor([[-5.1562, -5.0625, -2.2969,  2.2656, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:13:50,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:13:50,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.15 | bwd_microstep: 1290.68 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 1289.27 | step_microstep: 1.89
[2025-11-06 18:13:50,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.40 | bwd: 1291.61 | bwd_inner: 2.17 | bwd_allreduce: 1289.32 | step: 1.98
 34%|███▍      | 1199/3507 [29:04<45:59,  1.20s/it]                                                   {'loss': 0.1712, 'learning_rate': 1.5321717373640313e-05, 'epoch': 0.34}
 34%|███▍      | 1199/3507 [29:04<45:59,  1.20s/it]tensor([[-2.7812, -0.2100,  1.5156, -2.4062, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.9062, -3.0938,  0.6641,  0.7891, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -2.2031,  1.3125,  0.1738, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9688, -1.1328,  1.6094,  4.3750, -0.6133]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -2.3594,  2.2969,  0.1387, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5156, -2.3125,  0.9336,  2.3594, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:13:50,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.43 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -4.2500, -1.1016,  2.6562, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4688, -4.2188,  0.9727,  0.8555, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:13:50,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:13:50,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.92 | bwd_microstep: 60.75 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 59.70 | step_microstep: 1.78
[2025-11-06 18:13:50,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 416.38 | bwd: 61.46 | bwd_inner: 1.59 | bwd_allreduce: 59.74 | step: 1.85
 34%|███▍      | 1200/3507 [29:04<39:58,  1.04s/it]                                                   {'loss': 0.6406, 'learning_rate': 1.5313894502193457e-05, 'epoch': 0.34}
 34%|███▍      | 1200/3507 [29:04<39:58,  1.04s/it]tensor([[-3.5156, -2.7500,  0.4043,  3.5156, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -2.7500,  0.6875,  0.3828, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:51,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.53 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -4.3438, -0.2139,  1.5703, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -4.2812, -1.8516,  2.0000, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -2.2344,  1.4844, -0.2041, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9219, -2.9062, -1.3359,  2.2500, -1.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -3.0000, -0.1631,  2.8438, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -0.0579,  3.5781, -1.3906, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:13:54,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:13:54,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.72 | bwd_microstep: 3187.66 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 3186.49 | step_microstep: 2.11
[2025-11-06 18:13:54,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.28 | bwd: 3188.56 | bwd_inner: 1.90 | bwd_allreduce: 3186.53 | step: 2.19
 34%|███▍      | 1201/3507 [29:08<1:08:31,  1.78s/it]                                                     {'loss': 0.1869, 'learning_rate': 1.530606709656011e-05, 'epoch': 0.34}
 34%|███▍      | 1201/3507 [29:08<1:08:31,  1.78s/it]tensor([[-4.0938, -3.1250,  0.1465,  2.6875, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:54,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.90 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.8594, -1.2656,  2.3750,  0.0918, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.9219,  0.0815, -0.2988, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -3.9688,  0.5391,  1.8984, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -2.2500,  1.0547,  1.7422, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1406, -0.1934,  1.8438,  0.4375, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -3.0781,  0.2773, -0.2227, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -0.7227,  3.1250, -1.0781, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:13:55,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:13:55,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.06 | bwd_microstep: 164.30 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 163.17 | step_microstep: 1.69
[2025-11-06 18:13:55,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.99 | bwd: 165.32 | bwd_inner: 2.01 | bwd_allreduce: 163.20 | step: 1.77
 34%|███▍      | 1202/3507 [29:08<54:54,  1.43s/it]                                                     {'loss': 0.55, 'learning_rate': 1.5298235163419162e-05, 'epoch': 0.34}
 34%|███▍      | 1202/3507 [29:08<54:54,  1.43s/it]tensor([[-3.1719, -3.1875, -0.9453,  3.5781, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -2.4219,  0.8945,  1.0938, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:55,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.78 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[0.5352, 1.9688, 4.1250, 4.6562, 0.9414]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -3.6094, -1.1328,  3.1719, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.9375, -0.1699,  1.8281, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.3125, -7.6562, -3.5781, -0.0908, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0312,  0.4922,  2.4688, -1.5547, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -2.1406,  1.6875,  0.7344, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:13:56,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:13:56,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.82 | bwd_microstep: 996.59 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 995.43 | step_microstep: 2.16
[2025-11-06 18:13:56,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.63 | bwd: 997.58 | bwd_inner: 1.99 | bwd_allreduce: 995.47 | step: 2.23
 34%|███▍      | 1203/3507 [29:10<54:43,  1.43s/it]                                                   {'loss': 0.3021, 'learning_rate': 1.5290398709453363e-05, 'epoch': 0.34}
 34%|███▍      | 1203/3507 [29:10<54:43,  1.43s/it]tensor([[-4.2500, -2.8594,  0.5234,  1.3125, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3594,  0.9219,  1.9375, -1.3281, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.3750, -2.8281, -0.0879,  3.2188, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:56,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.04 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -3.8750, -0.9102,  2.7500, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5000, -5.5938, -1.1562, -1.1953, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -1.8750,  1.3828, -1.5078, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -4.1250, -0.8984,  2.7500, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -4.0625, -0.6719,  2.0625, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:13:57,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:13:57,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.63 | bwd_microstep: 263.09 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 262.04 | step_microstep: 2.03
[2025-11-06 18:13:57,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.71 | bwd: 264.34 | bwd_inner: 2.12 | bwd_allreduce: 262.08 | step: 2.12
 34%|███▍      | 1204/3507 [29:10<45:34,  1.19s/it]                                                   {'loss': 0.3505, 'learning_rate': 1.5282557741349328e-05, 'epoch': 0.34}
 34%|███▍      | 1204/3507 [29:10<45:34,  1.19s/it]tensor([[-5.0000, -2.9844,  1.3828,  1.3281, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -3.5312, -2.0000,  2.5781, -1.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -2.9219,  0.6133,  1.6953, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:57,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.84 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7031, -0.0859,  2.1875, -1.5938, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.1562, -2.3906,  1.5703,  1.7109, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2812, -1.5312,  1.3594,  0.7852, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.4844,  0.8438,  2.1250, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -4.1562, -0.7227,  0.8555, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:13:58,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.32
[2025-11-06 18:13:58,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.54 | bwd_microstep: 1230.85 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1229.81 | step_microstep: 2.49
[2025-11-06 18:13:58,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.41 | bwd: 1231.67 | bwd_inner: 1.69 | bwd_allreduce: 1229.86 | step: 2.57
 34%|███▍      | 1205/3507 [29:12<53:12,  1.39s/it]                                                   {'loss': 0.6526, 'learning_rate': 1.5274712265797523e-05, 'epoch': 0.34}
 34%|███▍      | 1205/3507 [29:12<53:12,  1.39s/it]tensor([[-2.5781, -2.7031, -1.4609,  2.2031, -1.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.4375, -3.7344, -0.6055,  2.2969, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -4.2500, -1.5781,  3.2500, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -3.7500,  1.0703,  0.7852, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3594, -3.5469, -2.0625,  1.6562, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.6250, -2.2656, -0.0581,  3.5312, -1.0234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:13:59,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.68 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -2.4844,  2.0469,  1.7266, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9844, -3.1562,  0.1152,  3.0625, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:14:00,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:14:00,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.27 | bwd_microstep: 217.90 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 216.44 | step_microstep: 1.92
[2025-11-06 18:14:00,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.97 | bwd: 218.92 | bwd_inner: 2.30 | bwd_allreduce: 216.48 | step: 2.01
 34%|███▍      | 1206/3507 [29:14<52:30,  1.37s/it]                                                   {'loss': 0.9809, 'learning_rate': 1.5266862289492247e-05, 'epoch': 0.34}
 34%|███▍      | 1206/3507 [29:14<52:30,  1.37s/it]tensor([[-5.0312, -3.3281,  0.5781,  1.0156, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -0.3809,  2.9062, -0.7109, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -3.8281, -1.4297,  2.7969, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:00,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.72 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.9375, -4.7188, -1.0078,  0.7227, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031, -2.6875,  0.5703,  2.7656, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -1.8828,  2.1562, -1.3672, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3750, -1.9766,  0.3477,  3.9688, -0.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -3.2812,  0.0879,  0.7305, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:04,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:14:04,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.61 | bwd_microstep: 3456.00 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 3454.87 | step_microstep: 2.08
[2025-11-06 18:14:04,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.34 | bwd: 3457.05 | bwd_inner: 2.00 | bwd_allreduce: 3454.92 | step: 2.16
 34%|███▍      | 1207/3507 [29:17<1:21:12,  2.12s/it]                                                     {'loss': 0.2597, 'learning_rate': 1.5259007819131658e-05, 'epoch': 0.34}
 34%|███▍      | 1207/3507 [29:17<1:21:12,  2.12s/it]tensor([[-3.8906, -3.6875, -1.4297,  2.1250, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.6250, -1.8594,  2.1875, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:04,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.76 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.3125, -4.3125, -0.7461,  1.7734, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -3.8906, -1.6641,  2.2969, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1406, -0.3125,  1.5234,  0.0703, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -4.2812, -0.4453,  0.8945, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -2.6094,  1.0234, -0.4961, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594, -0.7891,  2.0938, -0.7812, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:14:04,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:14:04,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.66 | bwd_microstep: 145.29 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 144.31 | step_microstep: 2.49
[2025-11-06 18:14:04,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.44 | bwd: 145.96 | bwd_inner: 1.46 | bwd_allreduce: 144.36 | step: 2.56
 34%|███▍      | 1208/3507 [29:18<1:02:17,  1.63s/it]                                                     {'loss': 0.313, 'learning_rate': 1.5251148861417733e-05, 'epoch': 0.34}
 34%|███▍      | 1208/3507 [29:18<1:02:17,  1.63s/it]tensor([[-5.5000, -3.5156,  0.0835, -0.8320, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -3.2188, -0.1206,  3.4219, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:04,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.01 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -5.1562, -1.7188,  1.2812, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.8438, -1.2500,  2.4062, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -3.3906,  1.4844,  1.7344, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0781,  0.2988,  3.4531,  1.6875, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -5.0312, -3.1875,  0.9883, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.6094,  0.9961, -0.1016, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:14:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.90 | bwd_microstep: 1706.18 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1704.91 | step_microstep: 1.90
[2025-11-06 18:14:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.94 | bwd: 1707.16 | bwd_inner: 2.07 | bwd_allreduce: 1704.95 | step: 1.98
 34%|███▍      | 1209/3507 [29:20<1:07:28,  1.76s/it]                                                     {'loss': 0.3188, 'learning_rate': 1.5243285423056287e-05, 'epoch': 0.34}
 34%|███▍      | 1209/3507 [29:20<1:07:28,  1.76s/it]tensor([[-4.0625, -2.2500,  1.7578,  2.3750, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8750,  0.4160,  3.4219, -1.6406, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -1.7578,  1.6797, -0.3750, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -3.3750, -0.4395,  2.7500, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:06,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5938,  0.0884,  3.7812, -2.3750, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -3.0625, -0.4395,  1.8359, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -3.0781,  1.5703, -0.2158, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5000, -0.9453,  1.4844,  0.8008, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:14:07,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:14:07,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.54 | bwd_microstep: 104.43 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 103.36 | step_microstep: 2.37
[2025-11-06 18:14:07,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.31 | bwd: 105.37 | bwd_inner: 1.80 | bwd_allreduce: 103.41 | step: 2.47
 35%|███▍      | 1210/3507 [29:21<53:16,  1.39s/it]                                                     {'loss': 0.2755, 'learning_rate': 1.5235417510756954e-05, 'epoch': 0.35}
 35%|███▍      | 1210/3507 [29:21<53:16,  1.39s/it]tensor([[-4.9375, -3.8906, -0.2676,  2.1719, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:07,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.90 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -1.6641,  2.0000,  0.1494, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0234, -1.3516, -0.1250,  4.5938,  0.5430]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.9062, -4.0000, -0.5312,  2.2812, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031, -1.3516,  1.0625, -1.7656, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -4.0312, -1.2266,  2.7031, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -4.7812, -2.9688,  0.5508, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.1875, -1.2656,  1.9453,  1.2031, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:10,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.32 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 18:14:10,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.88 | bwd_microstep: 2544.46 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2543.23 | step_microstep: 4.31
[2025-11-06 18:14:10,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.81 | bwd: 2545.25 | bwd_inner: 1.84 | bwd_allreduce: 2543.28 | step: 4.39
 35%|███▍      | 1211/3507 [29:23<1:10:49,  1.85s/it]                                                     {'loss': 1.4397, 'learning_rate': 1.522754513123319e-05, 'epoch': 0.35}
 35%|███▍      | 1211/3507 [29:23<1:10:49,  1.85s/it]tensor([[-3.7969, -2.3438,  1.0625,  2.1875, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5469, -0.5742,  2.1719,  0.8711, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7344, -0.7422,  1.7578,  3.4531, -0.6680]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:10,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.15 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7500, -2.2031,  1.5391, -0.6055, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0312, -2.1406,  0.2715,  2.0000, -1.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -5.1562, -2.8594,  1.5938, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -4.6250, -0.9336,  1.1953, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5625,  0.2070,  2.1406, -2.0938, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:14:10,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:14:10,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.47 | bwd_microstep: 48.54 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 47.36 | step_microstep: 2.56
[2025-11-06 18:14:10,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.64 | bwd: 49.37 | bwd_inner: 1.85 | bwd_allreduce: 47.39 | step: 2.64
 35%|███▍      | 1212/3507 [29:24<54:28,  1.42s/it]                                                     {'loss': 0.5638, 'learning_rate': 1.5219668291202258e-05, 'epoch': 0.35}
 35%|███▍      | 1212/3507 [29:24<54:28,  1.42s/it]tensor([[-3.8438, -1.5312,  1.7734, -0.0991, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:10,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.45 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.4062, -1.6406,  1.4141,  5.0312, -0.7539]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.4375,  0.4824,  1.8516, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -4.7188, -0.7188,  0.7773, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -3.0000,  0.0037,  3.3906, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -4.2188, -1.0234,  3.0312, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -3.5000, -0.5078,  3.1094, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -2.1875,  1.3672,  1.2500, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:12,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:14:12,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.52 | bwd_microstep: 1971.76 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1970.65 | step_microstep: 2.06
[2025-11-06 18:14:12,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.00 | bwd: 1972.65 | bwd_inner: 1.85 | bwd_allreduce: 1970.68 | step: 2.13
 35%|███▍      | 1213/3507 [29:26<1:05:34,  1.72s/it]                                                     {'loss': 0.6449, 'learning_rate': 1.5211786997385231e-05, 'epoch': 0.35}
 35%|███▍      | 1213/3507 [29:26<1:05:34,  1.72s/it]tensor([[-1.7969,  0.6523,  2.6719, -1.0938, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:14:13,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.68 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7500, -1.6641,  2.7031, -0.9766, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -4.6875, -0.8984,  1.3125, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -4.1250, -0.1177,  2.1562, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -4.4688, -1.2812,  2.5312, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -2.3750,  0.4941,  1.2734, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -1.8984,  1.4375, -3.1406, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -2.7500,  1.5469, -0.9883, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:13,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:14:13,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.59 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.55
[2025-11-06 18:14:13,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 469.30 | bwd: 2.90 | bwd_inner: 1.90 | bwd_allreduce: 0.88 | step: 1.63
 35%|███▍      | 1214/3507 [29:27<51:45,  1.35s/it]                                                     {'loss': 0.4756, 'learning_rate': 1.5203901256506979e-05, 'epoch': 0.35}
 35%|███▍      | 1214/3507 [29:27<51:45,  1.35s/it]tensor([[-4.2188, -3.9844, -1.2188,  2.9688, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:13,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.77 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0000, -3.1250,  0.0693,  2.8125, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.6250, -0.0284,  1.9062, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -4.2500, -2.0156,  2.0469, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5156, -2.7500, -1.7031,  1.6875, -0.9570]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-1.9453, -0.3008,  2.4062,  2.6250, -1.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -1.2891,  3.2031, -1.3047, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9375, -0.2100,  2.3125, -1.2656, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:16,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.95 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:14:16,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.51 | bwd_microstep: 2417.30 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 2416.44 | step_microstep: 4.95
[2025-11-06 18:14:16,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.30 | bwd: 2418.23 | bwd_inner: 1.63 | bwd_allreduce: 2416.48 | step: 5.03
 35%|███▍      | 1215/3507 [29:30<1:08:16,  1.79s/it]                                                     {'loss': 0.7454, 'learning_rate': 1.5196011075296164e-05, 'epoch': 0.35}
 35%|███▍      | 1215/3507 [29:30<1:08:16,  1.79s/it]tensor([[-2.9219, -2.3438,  0.3145,  3.4844, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:16,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.09 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6562, -0.2891,  3.2656, -2.0781, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -2.9688,  0.5430,  1.9531, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -4.0625,  0.8633, -2.2812, -5.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -5.5938, -1.7578,  1.7031, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5938, -4.8750,  0.0447,  1.2812, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -4.2500, -1.1562,  2.2344, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -5.9062, -2.7969,  1.3047, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:16,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:14:16,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.46 | bwd_microstep: 180.05 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 178.94 | step_microstep: 2.49
[2025-11-06 18:14:16,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 216.56 | bwd: 180.87 | bwd_inner: 1.75 | bwd_allreduce: 178.99 | step: 2.57
 35%|███▍      | 1216/3507 [29:30<52:43,  1.38s/it]                                                     {'loss': 0.2675, 'learning_rate': 1.5188116460485245e-05, 'epoch': 0.35}
 35%|███▍      | 1216/3507 [29:30<52:43,  1.38s/it]tensor([[-3.5156, -0.6445,  1.9844, -2.2031, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -2.2031,  0.4023,  1.9453, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -3.6719, -0.3633,  2.4062, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:16,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.62 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.9219, -2.7188, -0.3223,  3.7656, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -1.4141,  2.0156, -1.2031, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -1.1484,  3.0312, -1.0938, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -0.7070,  2.3281, -0.6367, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -1.5156,  2.4688, -0.4004, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:18,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:14:18,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.78 | bwd_microstep: 899.22 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 898.17 | step_microstep: 1.71
[2025-11-06 18:14:18,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.44 | bwd: 900.20 | bwd_inner: 1.84 | bwd_allreduce: 898.21 | step: 1.79
 35%|███▍      | 1217/3507 [29:31<51:57,  1.36s/it]                                                   {'loss': 0.4457, 'learning_rate': 1.518021741881046e-05, 'epoch': 0.35}
 35%|███▍      | 1217/3507 [29:31<51:57,  1.36s/it]tensor([[-3.3438, -3.5781, -2.1719,  1.7500, -1.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:14:18,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.71 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -3.7188, -0.6680,  2.7188, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -2.5781,  2.1094, -0.5391, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3125, -5.7188, -1.3516, -0.1079, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -3.1562, -0.5312,  2.6406, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.7656,  0.5156,  1.2812, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1250, -4.0000, -1.3516,  3.0938, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -2.4844,  0.9727,  1.2500, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:14:18,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:14:18,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.37 | bwd_microstep: 30.76 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 29.66 | step_microstep: 1.40
[2025-11-06 18:14:18,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.09 | bwd: 31.74 | bwd_inner: 1.92 | bwd_allreduce: 29.70 | step: 1.48
 35%|███▍      | 1218/3507 [29:32<40:37,  1.06s/it]                                                   {'loss': 0.878, 'learning_rate': 1.517231395701182e-05, 'epoch': 0.35}
 35%|███▍      | 1218/3507 [29:32<40:37,  1.06s/it]tensor([[-3.9219, -2.2969,  1.3906,  2.3281, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:18,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.36 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-1.2266,  1.6562,  3.4062, -1.0078, -1.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -4.4688, -1.0391,  1.9609, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9844, -1.8281,  1.5781,  3.9219, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -4.5312, -2.5312,  1.6328, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4062, -2.2500,  0.5312,  1.5703, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -3.4219,  0.9609,  1.1328, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -4.0625, -2.7344,  1.3516, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:19,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.28
[2025-11-06 18:14:19,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 64.24 | bwd_microstep: 1219.95 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1218.90 | step_microstep: 1.94
[2025-11-06 18:14:19,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 200.61 | bwd: 1220.88 | bwd_inner: 1.83 | bwd_allreduce: 1218.94 | step: 2.02
 35%|███▍      | 1219/3507 [29:33<44:58,  1.18s/it]                                                   {'loss': 0.2405, 'learning_rate': 1.5164406081833117e-05, 'epoch': 0.35}
 35%|███▍      | 1219/3507 [29:33<44:58,  1.18s/it]tensor([[-2.7500, -0.0869,  2.4531, -0.7148, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9375, -3.8750, -1.7891,  2.1562, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.7344, -3.0781, -0.2490,  2.8750, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4688, -2.4688, -0.3359,  4.0625, -0.6914]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:20,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.00 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -2.8438,  1.6484,  0.4941, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -3.1250,  0.8242,  0.9180, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0000, -1.0625,  2.6875,  2.5469, -1.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -1.8672,  1.4609,  0.6289, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:14:20,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:14:20,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.16 | bwd_microstep: 498.30 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 497.31 | step_microstep: 2.02
[2025-11-06 18:14:20,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.17 | bwd: 499.16 | bwd_inner: 1.70 | bwd_allreduce: 497.34 | step: 2.10
 35%|███▍      | 1220/3507 [29:34<43:43,  1.15s/it]                                                   {'loss': 0.8056, 'learning_rate': 1.5156493800021896e-05, 'epoch': 0.35}
 35%|███▍      | 1220/3507 [29:34<43:43,  1.15s/it]tensor([[-2.9375, -0.1416,  2.9219, -0.8242, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:21,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.94 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3750, -3.1094, -0.5469,  3.3438, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0312, -5.6875, -1.2969,  0.7500, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.3750, -1.9219,  2.6719, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -4.0938, -0.6562,  2.1562, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -2.7344,  0.8047, -2.0781, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9688,  1.1641,  3.8750, -1.4922, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -2.6562,  1.0547,  0.5195, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:23,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:14:23,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 2311.56 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 2310.42 | step_microstep: 2.14
[2025-11-06 18:14:23,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.40 | bwd: 2312.48 | bwd_inner: 1.87 | bwd_allreduce: 2310.47 | step: 2.23
 35%|███▍      | 1221/3507 [29:37<1:01:03,  1.60s/it]                                                     {'loss': 0.1204, 'learning_rate': 1.514857711832948e-05, 'epoch': 0.35}
 35%|███▍      | 1221/3507 [29:37<1:01:03,  1.60s/it]tensor([[-4.6562, -2.2812,  2.0781,  0.8828, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.4688, -6.2812, -1.5234,  1.2422, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -3.0312, -1.1562,  2.6562, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:23,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.95 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.8438, -3.8438, -1.5234,  2.8594, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8828,  0.7500,  2.6562, -1.4766, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -3.9688, -0.8555,  2.7656, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -0.1670,  2.0938, -1.9609, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.5469,  0.3750,  2.8906, -1.0234, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:23,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.22 | optimizer_step: 0.14
[2025-11-06 18:14:23,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.40 | bwd_microstep: 1.71 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.61 | step_microstep: 1.95
[2025-11-06 18:14:23,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.37 | bwd: 2.89 | bwd_inner: 2.05 | bwd_allreduce: 0.67 | step: 2.07
 35%|███▍      | 1222/3507 [29:37<47:00,  1.23s/it]                                                     {'loss': 0.3887, 'learning_rate': 1.5140656043510919e-05, 'epoch': 0.35}
 35%|███▍      | 1222/3507 [29:37<47:00,  1.23s/it]tensor([[-5.6562, -4.8750, -0.9258,  2.6875, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:24,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.21 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.6562,  1.3047,  3.0781, -2.0156, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5156, -0.8984,  2.9219,  0.6523, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -3.7500, -0.8125,  3.4844, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031, -2.9688, -0.2344,  2.2500, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -4.0938,  0.4004,  1.5547, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.7500, -0.7461,  1.3359, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -4.0625, -1.8281,  2.2812, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:27,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:14:27,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.36 | bwd_microstep: 3377.40 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 3376.18 | step_microstep: 2.04
[2025-11-06 18:14:27,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.59 | bwd: 3378.37 | bwd_inner: 1.98 | bwd_allreduce: 3376.22 | step: 2.13
 35%|███▍      | 1223/3507 [29:41<1:15:22,  1.98s/it]                                                     {'loss': 0.112, 'learning_rate': 1.5132730582325047e-05, 'epoch': 0.35}
 35%|███▍      | 1223/3507 [29:41<1:15:22,  1.98s/it]tensor([[-3.9688, -1.3125,  1.9219, -1.3438, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.9375, -6.0938, -0.1328, -1.7656, -7.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -2.4844,  1.2031,  0.5078, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -3.5625, -0.7969,  2.7344, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:27,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.58 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1250, -3.1875,  1.2031,  1.5078, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156, -1.2812,  1.7188,  1.3047, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -3.1562,  0.1113,  3.0000, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2500, -2.6875,  1.7734,  0.0117, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:14:28,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:14:28,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.09 | bwd_microstep: 107.37 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 106.12 | step_microstep: 1.63
[2025-11-06 18:14:28,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.69 | bwd: 108.43 | bwd_inner: 2.14 | bwd_allreduce: 106.16 | step: 1.71
 35%|███▍      | 1224/3507 [29:42<59:02,  1.55s/it]                                                     {'loss': 0.2603, 'learning_rate': 1.5124800741534407e-05, 'epoch': 0.35}
 35%|███▍      | 1224/3507 [29:42<59:02,  1.55s/it]tensor([[-4.1250, -3.3906, -0.2119,  2.7656, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -2.1875,  1.7891, -0.3672, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8594, -0.5234,  3.5938, -1.2188, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:28,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.18 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0938, -1.1719,  2.5625, -1.1094, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3438, -4.1562, -0.3281, -1.5312, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9219, -2.1406,  0.5000,  2.8125, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5938, -0.8555,  1.4375,  0.8047, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -3.3125,  0.3711,  1.1328, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:30,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 18:14:30,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.91 | bwd_microstep: 1630.52 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 1629.11 | step_microstep: 2.64
[2025-11-06 18:14:30,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 518.12 | bwd: 1631.59 | bwd_inner: 2.31 | bwd_allreduce: 1629.15 | step: 2.71
 35%|███▍      | 1225/3507 [29:44<1:06:21,  1.74s/it]                                                     {'loss': 0.1993, 'learning_rate': 1.5116866527905303e-05, 'epoch': 0.35}
 35%|███▍      | 1225/3507 [29:44<1:06:21,  1.74s/it]tensor([[-7.3438, -5.6250, -1.3125, -0.7656, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -3.1406,  0.1973, -1.6875, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:30,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.38 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3438, -3.5938, -0.2539,  2.7969, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -3.4062,  0.1367,  1.3359, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -3.7969,  0.1064,  0.8047, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -1.2422,  2.1875, -1.1406, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -4.6875, -0.9336,  1.6016, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3125, -3.9375, -1.1562,  2.6094, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:30,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:14:30,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.36 | bwd_microstep: 137.97 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 136.98 | step_microstep: 1.80
[2025-11-06 18:14:30,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.76 | bwd: 138.92 | bwd_inner: 1.76 | bwd_allreduce: 137.03 | step: 1.89
 35%|███▍      | 1226/3507 [29:44<52:54,  1.39s/it]                                                     {'loss': 0.199, 'learning_rate': 1.5108927948207752e-05, 'epoch': 0.35}
 35%|███▍      | 1226/3507 [29:44<52:54,  1.39s/it]tensor([[-3.4219, -3.6875, -2.0469,  2.1094, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -4.2188, -1.8906,  2.4375, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:31,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.71 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13
tensor([[-3.8438, -0.8086,  3.0625, -0.9648, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -3.2344,  0.0625,  3.2969, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.1250, -6.1562, -1.6719, -1.7109, -6.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4219, -2.2812,  1.0547,  2.9219, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -2.1406,  2.4844, -1.2656, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -3.3125,  0.4785,  2.2812, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:32,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.24 | optimizer_step: 0.24
[2025-11-06 18:14:32,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.62 | bwd_microstep: 1345.43 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1344.39 | step_microstep: 2.27
[2025-11-06 18:14:32,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.35 | bwd: 1346.47 | bwd_inner: 1.80 | bwd_allreduce: 1344.47 | step: 2.41
 35%|███▍      | 1227/3507 [29:46<56:30,  1.49s/it]                                                   {'loss': 0.3893, 'learning_rate': 1.5100985009215519e-05, 'epoch': 0.35}
 35%|███▍      | 1227/3507 [29:46<56:30,  1.49s/it]tensor([[-4.4062, -3.7500, -0.3828,  3.0312, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4219, -2.5938, -1.3438,  2.1562, -0.8086]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -5.3750, -2.5938,  1.7344, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:32,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.91 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.3906, -2.4531, -1.5469,  1.3906, -0.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.5000, -0.8438,  1.6875, -1.5156, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.6250, -3.4844, -1.3203,  2.0156, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8906, -0.7461,  3.1094, -1.2188, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -2.7500,  1.7344,  0.1089, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:14:33,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.06 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:14:33,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.60 | bwd_microstep: 104.14 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 102.97 | step_microstep: 2.87
[2025-11-06 18:14:33,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.52 | bwd: 105.11 | bwd_inner: 1.99 | bwd_allreduce: 103.00 | step: 2.94
 35%|███▌      | 1228/3507 [29:46<44:47,  1.18s/it]                                                   {'loss': 0.6984, 'learning_rate': 1.5093037717706063e-05, 'epoch': 0.35}
 35%|███▌      | 1228/3507 [29:46<44:47,  1.18s/it]tensor([[-1.8828,  0.9258,  3.6719, -0.2314, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1641,  1.0391,  2.1250, -0.9531, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4062, -2.5938,  0.8594,  0.8867, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:33,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.70 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8438, -3.1250,  0.2617,  0.2305, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9531, -2.1719,  0.6211,  3.0312, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -1.0312,  3.4531, -0.6562, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -3.7656,  0.3926, -2.1719, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -2.0000,  1.8750,  0.6055, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:34,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:14:34,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.35 | bwd_microstep: 1392.75 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1391.72 | step_microstep: 1.57
[2025-11-06 18:14:34,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.08 | bwd: 1393.53 | bwd_inner: 1.66 | bwd_allreduce: 1391.75 | step: 1.65
 35%|███▌      | 1229/3507 [29:48<51:48,  1.36s/it]                                                   {'loss': 0.436, 'learning_rate': 1.5085086080460573e-05, 'epoch': 0.35}
 35%|███▌      | 1229/3507 [29:48<51:48,  1.36s/it]tensor([[-4.5625, -1.8047,  2.7344,  0.3770, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.1250, -6.6562, -1.4844,  0.8867, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -1.2578,  1.8516, -1.7500, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -4.2812, -1.7188,  1.8828, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0469, -3.0469, -0.8477,  3.3281, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -2.8125,  0.8203,  0.5117, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -0.1660,  3.5938, -1.3984, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:35,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.28 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7500, -1.9922,  0.3750,  2.0625, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:35,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:14:35,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.72 | bwd_microstep: 1.83 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.85
[2025-11-06 18:14:35,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.01 | bwd: 2.78 | bwd_inner: 1.79 | bwd_allreduce: 0.88 | step: 1.93
 35%|███▌      | 1230/3507 [29:49<45:39,  1.20s/it]                                                   {'loss': 0.356, 'learning_rate': 1.5077130104263944e-05, 'epoch': 0.35}
 35%|███▌      | 1230/3507 [29:49<45:39,  1.20s/it]tensor([[-1.3203,  1.1484,  2.6719, -1.3516, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -2.3594,  2.4531, -1.1094, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:35,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.22 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -1.9922,  2.2500,  1.1172, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -3.6719, -0.5273,  1.9219, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.5625, -1.3750,  2.3125, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3594, -2.6562, -1.6016,  2.1875, -0.7461]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -1.6406,  1.9375, -0.2891, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3750, -2.6719, -1.6719,  1.7891, -0.7773]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:38,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.32 | optimizer_gradients: 0.21 | optimizer_step: 0.17
[2025-11-06 18:14:38,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.35 | bwd_microstep: 2184.09 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 2183.04 | step_microstep: 4.03
[2025-11-06 18:14:38,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.60 | bwd: 2184.97 | bwd_inner: 1.75 | bwd_allreduce: 2183.08 | step: 4.12
 35%|███▌      | 1231/3507 [29:52<1:00:56,  1.61s/it]                                                     {'loss': 0.131, 'learning_rate': 1.506916979590477e-05, 'epoch': 0.35}
 35%|███▌      | 1231/3507 [29:52<1:00:56,  1.61s/it]tensor([[-2.1250, -0.3848,  2.0625,  1.4531, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -2.9531, -0.2930,  2.5625, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781, -1.6953,  1.5234,  0.8516, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.2656,  0.0723,  2.1406, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594,  0.6133,  3.9531, -1.5391, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:14:38,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.64 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14
tensor([[-5.6562, -3.2500,  1.3672,  0.1113, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -4.0000, -0.8555,  3.8125, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -2.0469,  2.5000, -1.1016, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:38,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:14:38,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.32 | bwd_microstep: 110.51 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 109.45 | step_microstep: 1.61
[2025-11-06 18:14:38,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.99 | bwd: 111.77 | bwd_inner: 2.02 | bwd_allreduce: 109.52 | step: 1.76
 35%|███▌      | 1232/3507 [29:52<48:49,  1.29s/it]                                                     {'loss': 0.9507, 'learning_rate': 1.5061205162175343e-05, 'epoch': 0.35}
 35%|███▌      | 1232/3507 [29:52<48:49,  1.29s/it]tensor([[-4.9688, -2.8125,  0.9766, -0.3281, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1875, -5.3125, -0.5664, -0.0879, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -1.9453,  1.7578,  0.6250, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:39,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.31 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -1.9531,  1.6406,  0.2656, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -1.8438,  2.4844, -0.1221, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -2.8438,  0.5625, -0.2832, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -2.7031,  0.8906, -0.1553, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -1.8047,  2.0781, -0.3965, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:40,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.34
[2025-11-06 18:14:40,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.06 | bwd_microstep: 1503.39 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1502.20 | step_microstep: 2.42
[2025-11-06 18:14:40,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.39 | bwd: 1504.44 | bwd_inner: 2.05 | bwd_allreduce: 1502.26 | step: 2.51
 35%|███▌      | 1233/3507 [29:54<55:50,  1.47s/it]                                                   {'loss': 0.5508, 'learning_rate': 1.5053236209871647e-05, 'epoch': 0.35}
 35%|███▌      | 1233/3507 [29:54<55:50,  1.47s/it]tensor([[-3.5938, -3.5938, -1.5781,  2.1250, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -1.4609,  1.8750, -0.9961, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -2.5625,  0.9180,  1.6016, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:41,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.62 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.9766,  1.0234,  3.9062, -0.4668, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3906,  1.4141,  2.8906, -1.5781, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.0625, -2.0938,  0.6562,  2.3750, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -3.5156,  0.2080, -1.7109, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.8125, -7.5938, -2.8750, -4.0312, -7.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:41,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:14:41,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 43.86 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 42.62 | step_microstep: 1.42
[2025-11-06 18:14:41,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.34 | bwd: 44.98 | bwd_inner: 2.17 | bwd_allreduce: 42.67 | step: 1.52
 35%|███▌      | 1234/3507 [29:55<44:27,  1.17s/it]                                                   {'loss': 0.759, 'learning_rate': 1.5045262945793342e-05, 'epoch': 0.35}
 35%|███▌      | 1234/3507 [29:55<44:27,  1.17s/it]tensor([[-4.9375, -2.2656,  2.0469, -0.0383, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6719, -1.4688,  0.8945,  5.0625, -0.0332]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:41,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.05 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3594, -0.0189,  3.5938, -1.3984, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -3.7969, -2.4375,  1.5625, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5625, -1.5078,  1.5234,  3.8594, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5312,  0.2520,  3.2344, -0.5508, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -0.8594,  2.4844, -0.7109, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0000,  1.2578,  4.1250,  2.5625, -0.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:14:42,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:14:42,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.61 | bwd_microstep: 703.41 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 702.42 | step_microstep: 2.30
[2025-11-06 18:14:42,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.69 | bwd: 704.35 | bwd_inner: 1.71 | bwd_allreduce: 702.47 | step: 2.39
 35%|███▌      | 1235/3507 [29:56<44:01,  1.16s/it]                                                   {'loss': 0.0959, 'learning_rate': 1.5037285376743787e-05, 'epoch': 0.35}
 35%|███▌      | 1235/3507 [29:56<44:01,  1.16s/it]tensor([[-3.6250, -0.8242,  2.6406, -1.1016, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -3.2344, -0.2852,  3.4688, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -1.0938,  3.0938,  0.8203, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1719,  0.8281,  3.5000, -1.2031, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -3.4844,  0.4160,  1.7578, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -5.2500, -1.6016,  1.1797, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -1.1016,  2.0156, -0.4551, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:43,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.29 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.3125, -4.0625, -0.0493,  2.1406, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:43,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:14:43,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.39 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.68
[2025-11-06 18:14:43,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 434.60 | bwd: 3.20 | bwd_inner: 2.20 | bwd_allreduce: 0.87 | step: 1.75
 35%|███▌      | 1236/3507 [29:57<43:23,  1.15s/it]                                                   {'loss': 0.1046, 'learning_rate': 1.5029303509529991e-05, 'epoch': 0.35}
 35%|███▌      | 1236/3507 [29:57<43:23,  1.15s/it]tensor([[-4.6875, -3.6094, -0.1543,  1.9844, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6250, -6.5625, -2.3438,  0.1035, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -4.3125, -0.5977,  1.4531, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.0000,  0.2969,  0.8711, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -5.1250, -0.7852,  1.7031, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.2656,  1.0625, -0.0449, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:44,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.58 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5000, -3.8125, -0.4922,  2.7656, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -3.6406, -0.0364,  2.6875, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:44,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:14:44,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.19 | bwd_microstep: 89.97 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 88.92 | step_microstep: 1.70
[2025-11-06 18:14:44,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.79 | bwd: 90.80 | bwd_inner: 1.68 | bwd_allreduce: 88.97 | step: 1.79
 35%|███▌      | 1237/3507 [29:58<41:52,  1.11s/it]                                                   {'loss': 0.3742, 'learning_rate': 1.5021317350962648e-05, 'epoch': 0.35}
 35%|███▌      | 1237/3507 [29:58<41:52,  1.11s/it]tensor([[-4.4062, -4.5625, -2.1406,  2.6719, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7500, -5.5938, -1.0859,  1.7891, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1250,  0.6172,  3.1250, -0.4629, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.9062, -2.0000,  2.2812, -0.8633, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -2.2031,  1.2344,  0.2676, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -4.4062, -0.7695,  1.3125, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3281, -1.2344,  1.6250,  3.2969, -1.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:46,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.84 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3281,  0.6602,  3.1406, -1.5703, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:47,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:14:47,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 1.94 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.60
[2025-11-06 18:14:47,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.51 | bwd: 2.66 | bwd_inner: 1.57 | bwd_allreduce: 0.95 | step: 2.68
 35%|███▌      | 1238/3507 [30:00<58:10,  1.54s/it]                                                   {'loss': 0.6506, 'learning_rate': 1.5013326907856105e-05, 'epoch': 0.35}
 35%|███▌      | 1238/3507 [30:00<58:10,  1.54s/it]tensor([[-6.9688, -4.0938,  0.0508, -3.0000, -5.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812, -1.2969,  1.8516,  2.0469, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:47,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.55 | bwd_microstep: 1.54 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2812, -3.5156,  0.5195,  1.1094, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -1.6328,  2.4375, -0.5234, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -2.3438,  1.8672, -2.5938, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2188, -2.0625,  0.9727,  2.7656, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -3.9375, -0.9531,  1.9766, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -1.4766,  2.5625, -0.6250, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:14:47,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:14:47,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.41 | bwd_microstep: 577.81 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 576.94 | step_microstep: 2.09
[2025-11-06 18:14:47,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.98 | bwd: 579.35 | bwd_inner: 2.23 | bwd_allreduce: 576.98 | step: 2.17
 35%|███▌      | 1239/3507 [30:01<51:12,  1.35s/it]                                                   {'loss': 0.2179, 'learning_rate': 1.5005332187028367e-05, 'epoch': 0.35}
 35%|███▌      | 1239/3507 [30:01<51:12,  1.35s/it]tensor([[-5.1250, -3.0312,  0.5977,  0.0640, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -4.0938, -1.0859,  3.2812, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219, -3.7500, -2.2812,  1.9375, -1.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094,  0.3906,  3.4219, -2.3906, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.1875, -3.2031,  1.7266, -1.1016, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -2.8906, -0.2734,  2.6875, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.1875, -0.2852,  1.5547, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:50,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.07 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1562, -3.4062, -0.3613,  2.3438, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:50,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:14:50,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.00 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.44
[2025-11-06 18:14:50,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.08 | bwd: 2.40 | bwd_inner: 1.41 | bwd_allreduce: 0.84 | step: 2.53
 35%|███▌      | 1240/3507 [30:04<1:02:31,  1.65s/it]                                                     {'loss': 0.5024, 'learning_rate': 1.4997333195301088e-05, 'epoch': 0.35}
 35%|███▌      | 1240/3507 [30:04<1:02:31,  1.65s/it]tensor([[-4.5625, -3.9531, -1.0234,  2.2500, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -3.5000, -0.2061,  2.1250, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3750, -2.8594,  0.7422,  1.5469, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1875, -0.5781,  1.9688, -1.0859, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:14:50,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.90 | bwd_microstep: 5.58 | bwd_inner_microstep: 5.45 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1250, -3.7656, -1.1406,  2.0000, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -3.9844, -0.1377,  2.6562, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -2.5938,  1.0391,  0.6250, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7188, -2.7188, -1.3828,  1.8125, -1.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:14:50,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:14:50,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.72 | bwd_microstep: 131.09 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 130.11 | step_microstep: 1.57
[2025-11-06 18:14:50,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.64 | bwd: 136.68 | bwd_inner: 6.37 | bwd_allreduce: 130.16 | step: 1.66
 35%|███▌      | 1241/3507 [30:04<50:10,  1.33s/it]                                                     {'loss': 0.9421, 'learning_rate': 1.498932993949957e-05, 'epoch': 0.35}
 35%|███▌      | 1241/3507 [30:04<50:10,  1.33s/it]tensor([[-4.0625, -3.2031, -0.0135,  2.5469, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -4.4375, -0.2451,  0.6406, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6094, -1.8828,  0.2158,  1.9766, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -4.4375, -1.9062,  2.3750, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.2812,  1.7734,  0.0253, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -4.6562, -1.5000,  2.0000, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -3.6562, -0.6719,  2.2812, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:52,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.37 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -3.2969,  0.2305,  3.0156, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:53,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:14:53,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.16 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.87
[2025-11-06 18:14:53,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.53 | bwd: 3.01 | bwd_inner: 2.00 | bwd_allreduce: 0.85 | step: 2.95
 35%|███▌      | 1242/3507 [30:06<1:00:20,  1.60s/it]                                                     {'loss': 0.2393, 'learning_rate': 1.4981322426452747e-05, 'epoch': 0.35}
 35%|███▌      | 1242/3507 [30:06<1:00:20,  1.60s/it]tensor([[-4.5312, -4.3438, -1.3672,  2.9844, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7188,  0.1611,  2.8750, -1.0391, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281, -0.4297,  2.0469, -2.2344, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:53,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.81 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2188, -0.9453,  2.0156, -0.0315, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -3.2031,  0.0811,  2.7188, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -1.7266,  1.5703,  0.6406, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.1963, 1.8828, 4.6250, 4.4688, 0.5820]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0000, -3.9688, -0.0962,  2.7500, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:14:54,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:14:54,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.57 | bwd_microstep: 148.21 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 147.28 | step_microstep: 1.55
[2025-11-06 18:14:54,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.41 | bwd: 149.01 | bwd_inner: 1.55 | bwd_allreduce: 147.32 | step: 1.62
 35%|███▌      | 1243/3507 [30:08<54:30,  1.44s/it]                                                     {'loss': 0.2156, 'learning_rate': 1.4973310662993195e-05, 'epoch': 0.35}
 35%|███▌      | 1243/3507 [30:08<54:30,  1.44s/it]tensor([[-5.1562, -3.2344,  0.6914,  0.6094, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -5.0000, -2.7656,  1.5234, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281, -0.3613,  3.8750,  0.7695, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -3.9062, -0.8203,  2.3438, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -3.1250,  1.2578,  0.8008, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4766,  0.9922,  2.5625, -1.0938, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.4531, -0.1846,  1.5781, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:55,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.43 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.1406,  1.0703,  4.0312, -1.1094, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:56,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:14:56,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.85 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.31
[2025-11-06 18:14:56,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 450.31 | bwd: 2.77 | bwd_inner: 1.77 | bwd_allreduce: 0.86 | step: 2.39
 35%|███▌      | 1244/3507 [30:10<1:00:52,  1.61s/it]                                                     {'loss': 0.2808, 'learning_rate': 1.4965294655957103e-05, 'epoch': 0.35}
 35%|███▌      | 1244/3507 [30:10<1:00:52,  1.61s/it]tensor([[-5.5938, -4.3125, -0.5273,  1.4844, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -1.4062,  2.7500, -0.5195, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -2.6406,  2.0156, -0.2480, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:56,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 337.38 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2812, -4.1250, -1.3047,  3.0312, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -1.9297,  1.9922, -0.8398, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -4.0000,  0.3066,  0.4531, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -2.3906,  1.5234,  0.9375, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6562, -3.2969, -1.0781,  1.5938, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:14:57,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:14:57,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.90 | bwd_microstep: 256.46 | bwd_inner_microstep: 13.67 | bwd_allreduce_microstep: 242.70 | step_microstep: 1.84
[2025-11-06 18:14:57,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 529.32 | bwd: 257.30 | bwd_inner: 14.42 | bwd_allreduce: 242.73 | step: 1.92
 36%|███▌      | 1245/3507 [30:11<55:15,  1.47s/it]                                                     {'loss': 0.291, 'learning_rate': 1.4957274412184295e-05, 'epoch': 0.36}
 36%|███▌      | 1245/3507 [30:11<55:15,  1.47s/it]tensor([[-3.5312, -0.4004,  2.9688, -1.3828, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.9375, -7.7500, -3.2969, -0.8711, -6.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -3.2500, -0.2100,  1.7031, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031,  0.0669,  3.6094,  0.5820, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -3.6094, -0.6484,  3.2812, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.9062,  0.0991,  1.1328, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.6562,  0.1660,  0.8555, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:14:58,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.42 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -3.9062, -1.3516,  2.3281, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:58,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:14:58,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.40 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.52
[2025-11-06 18:14:58,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.86 | bwd: 2.84 | bwd_inner: 1.76 | bwd_allreduce: 0.92 | step: 2.60
 36%|███▌      | 1246/3507 [30:12<55:53,  1.48s/it]                                                   {'loss': 0.3648, 'learning_rate': 1.4949249938518203e-05, 'epoch': 0.36}
 36%|███▌      | 1246/3507 [30:12<55:53,  1.48s/it]tensor([[-5.1875, -3.5469,  0.2109,  0.7812, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.6562, -0.2715,  2.7031, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -3.6250,  0.1118,  1.4688, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:14:59,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.23 | bwd_microstep: 3.03 | bwd_inner_microstep: 2.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.4219, -0.2490,  2.5000,  0.7539, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -3.7031, -1.5156,  1.8750, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -2.7188,  0.9297,  0.4395, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8594, -2.0469,  0.9844, -0.0153, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562e+00, -3.2656e+00, -4.3488e-04,  2.5938e+00, -2.3750e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:03,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.23 | optimizer_step: 0.33
[2025-11-06 18:15:03,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.00 | bwd_microstep: 1607.85 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1606.76 | step_microstep: 141.13
[2025-11-06 18:15:03,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.25 | bwd: 1610.88 | bwd_inner: 3.87 | bwd_allreduce: 1606.83 | step: 141.23
 36%|███▌      | 1247/3507 [30:17<1:32:02,  2.44s/it]                                                     {'loss': 0.3114, 'learning_rate': 1.4941221241805868e-05, 'epoch': 0.36}
 36%|███▌      | 1247/3507 [30:17<1:32:02,  2.44s/it]tensor([[-3.6719, -2.0625,  1.2734,  1.9062, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -2.3281,  1.5547,  2.8750, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4531, -0.9258,  2.4531,  0.1523, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -3.6094,  0.0723,  2.3594, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:03,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6250, -3.9375, -0.6797,  2.3438, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -3.5000, -0.9297,  2.6719, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -1.5078,  2.1250, -1.6094, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8125, -5.0000, -0.1396,  0.8945, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:15:04,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:15:04,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.78 | bwd_microstep: 142.15 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 141.20 | step_microstep: 1.54
[2025-11-06 18:15:04,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.08 | bwd: 143.06 | bwd_inner: 1.68 | bwd_allreduce: 141.24 | step: 1.63
 36%|███▌      | 1248/3507 [30:17<1:10:43,  1.88s/it]                                                     {'loss': 0.4169, 'learning_rate': 1.4933188328897933e-05, 'epoch': 0.36}
 36%|███▌      | 1248/3507 [30:17<1:10:43,  1.88s/it]tensor([[-2.9375,  0.0194,  3.5469, -0.3203, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9531,  0.2178,  3.4062, -1.6406, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1797,  1.6562,  4.2812,  0.4629, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:04,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.01 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.1406,  0.1289,  3.7656,  2.3750, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -2.3438,  1.2578, -0.0630, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[h264 @ 0xd41e580] mmco: unref short failure
[h264 @ 0xd41e580] mmco: unref short failure
tensor([[-6.3750, -4.7188,  0.1455,  1.7578, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -1.3125,  3.0938, -0.8906, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -3.0469,  0.5469, -1.2734, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:15:06,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:15:06,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.92 | bwd_microstep: 1433.64 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1432.56 | step_microstep: 2.11
[2025-11-06 18:15:06,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 484.95 | bwd: 1434.51 | bwd_inner: 1.79 | bwd_allreduce: 1432.59 | step: 2.18
 36%|███▌      | 1249/3507 [30:19<1:11:36,  1.90s/it]                                                     {'loss': 0.7632, 'learning_rate': 1.492515120664865e-05, 'epoch': 0.36}
 36%|███▌      | 1249/3507 [30:19<1:11:36,  1.90s/it]tensor([[-3.6562, -3.1875, -0.6211,  2.3594, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:06,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.3438, -3.1250,  0.4863,  2.4062, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -4.0938, -1.6406,  2.3281, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -1.3359,  1.6172,  1.4922, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -3.0469,  0.7852,  0.6523, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -4.3125, -1.0391,  1.8672, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -2.4062,  1.4297,  1.5156, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594,  0.2197,  3.0000, -1.7969, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:15:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:15:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.98 | bwd_microstep: 48.17 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 47.03 | step_microstep: 1.65
[2025-11-06 18:15:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.56 | bwd: 49.13 | bwd_inner: 1.93 | bwd_allreduce: 47.06 | step: 1.74
 36%|███▌      | 1250/3507 [30:20<55:27,  1.47s/it]                                                     {'loss': 0.3248, 'learning_rate': 1.4917109881915844e-05, 'epoch': 0.36}
 36%|███▌      | 1250/3507 [30:20<55:27,  1.47s/it]tensor([[-4.0938, -1.0391,  3.1250, -0.3301, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:06,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.13 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.5156, -3.6406, -1.7031,  2.4375, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -1.3359,  1.8672,  2.5312, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -2.1094,  2.3906, -0.8984, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -4.9688, -1.0078,  0.1699, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -2.6875,  0.7656, -0.3398, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3125, -4.0000, -0.1768, -1.6172, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.4844,  0.1050,  1.4688, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:10,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:15:10,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.78 | bwd_microstep: 1825.57 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1824.52 | step_microstep: 2.45
[2025-11-06 18:15:10,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.91 | bwd: 1826.42 | bwd_inner: 1.75 | bwd_allreduce: 1824.55 | step: 2.52
 36%|███▌      | 1251/3507 [30:23<1:18:50,  2.10s/it]                                                     {'loss': 0.4931, 'learning_rate': 1.490906436156094e-05, 'epoch': 0.36}
 36%|███▌      | 1251/3507 [30:23<1:18:50,  2.10s/it]tensor([[-4.5000, -2.9375,  0.6484,  1.3047, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -3.5938,  0.8438,  1.5547, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:10,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.01 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8125, -2.8281,  1.0625,  0.9375, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.0781,  1.4375,  1.5312, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -1.5547,  1.5156,  0.8789, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.6562, -5.1250,  0.3945, -0.3008, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -3.4531,  0.1113,  2.1406, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.5938,  1.0000,  1.8359, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:15:10,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:15:10,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.73 | bwd_microstep: 75.27 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 74.18 | step_microstep: 1.57
[2025-11-06 18:15:10,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.76 | bwd: 76.21 | bwd_inner: 1.87 | bwd_allreduce: 74.22 | step: 1.65
 36%|███▌      | 1252/3507 [30:24<1:00:36,  1.61s/it]                                                     {'loss': 0.6283, 'learning_rate': 1.4901014652448939e-05, 'epoch': 0.36}
 36%|███▌      | 1252/3507 [30:24<1:00:36,  1.61s/it]tensor([[-3.5312, -0.2656,  3.2812, -1.5078, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.8750, -5.3125, -1.7031,  1.9219, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.4062,  2.4688,  1.6250, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:10,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.24 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-0.3262,  0.5234,  2.8281,  5.2500,  0.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -1.3906,  2.1719, -0.9805, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2188, -3.5156,  1.1172, -0.8828, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -3.4531,  1.4688,  1.6875, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3281, -2.8125, -0.0337,  3.2500, -1.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 18:15:13,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.24 | optimizer_step: 0.33
[2025-11-06 18:15:13,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.35 | bwd_microstep: 2415.16 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 2413.86 | step_microstep: 2.47
[2025-11-06 18:15:13,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.62 | bwd: 2416.08 | bwd_inner: 2.02 | bwd_allreduce: 2413.91 | step: 2.56
 36%|███▌      | 1253/3507 [30:27<1:15:52,  2.02s/it]                                                     {'loss': 1.2161, 'learning_rate': 1.4892960761448417e-05, 'epoch': 0.36}
 36%|███▌      | 1253/3507 [30:27<1:15:52,  2.02s/it]tensor([[-5.5625, -4.7500, -1.5703,  0.7578, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -1.3984,  2.6250,  0.3887, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.2969,  0.3672,  1.4453, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:13,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.53 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.3438, -0.8086,  2.6406,  0.5312, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -3.0625, -0.3691,  2.3594, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -2.7188,  1.6875,  0.7070, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -3.2656,  0.9375,  1.6016, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.8281,  0.1553,  3.4375, -0.6719, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([2], device='cuda:0')
[2025-11-06 18:15:14,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:15:14,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.44 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.73 | step_microstep: 1.95
[2025-11-06 18:15:14,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 416.99 | bwd: 2.56 | bwd_inner: 1.66 | bwd_allreduce: 0.77 | step: 2.04
 36%|███▌      | 1254/3507 [30:27<58:17,  1.55s/it]                                                     {'loss': 0.5371, 'learning_rate': 1.4884902695431516e-05, 'epoch': 0.36}
 36%|███▌      | 1254/3507 [30:27<58:17,  1.55s/it]tensor([[-4.4688, -3.6406, -0.2373,  2.4844, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -4.2812, -1.3359,  2.4844, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -3.2344, -1.0234,  3.4062, -1.2109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:14,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.57 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.5938, -3.8125, -0.5820,  2.2656, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2500,  0.6719,  3.0938, -1.2344, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -1.5938,  2.3438,  0.0095, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3281, -2.7344, -0.1055,  2.7812, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.3906,  0.7539,  2.0469, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:16,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 18:15:16,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.22 | bwd_microstep: 2394.20 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 2392.98 | step_microstep: 2.35
[2025-11-06 18:15:16,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.80 | bwd: 2395.02 | bwd_inner: 1.89 | bwd_allreduce: 2393.02 | step: 2.42
 36%|███▌      | 1255/3507 [30:30<1:12:45,  1.94s/it]                                                     {'loss': 0.4321, 'learning_rate': 1.4876840461273939e-05, 'epoch': 0.36}
 36%|███▌      | 1255/3507 [30:30<1:12:45,  1.94s/it]tensor([[-4.7812, -2.7656,  1.4922,  1.5469, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469, -3.0781, -0.7383,  1.8281, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -0.7305,  2.6094, -1.1172, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5938, -3.7031, -1.9609,  1.7812, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:17,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.79 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7031, -0.2441,  2.5156, -0.1211, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2188,  0.0613,  1.7734, -1.1562, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -2.9844,  1.0859,  1.5078, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625,  1.2656,  3.6094, -2.0156, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:17,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:15:17,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.79 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.77
[2025-11-06 18:15:17,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.61 | bwd: 2.67 | bwd_inner: 1.76 | bwd_allreduce: 0.77 | step: 1.86
 36%|███▌      | 1256/3507 [30:31<55:19,  1.47s/it]                                                     {'loss': 0.2438, 'learning_rate': 1.486877406585495e-05, 'epoch': 0.36}
 36%|███▌      | 1256/3507 [30:31<55:19,  1.47s/it]tensor([[-2.5000, -0.3398,  2.3438,  0.4863, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5078,  1.0938,  3.0000, -0.6367, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.2812, -2.0156,  1.8672,  0.6211, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:17,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.92 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.5781, -3.0938, -0.2656,  3.2031, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -4.1875, -1.1562,  2.5781, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.7656,  0.8945,  1.2578, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -2.9375, -0.3965,  2.6250, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -4.5625,  0.2773,  1.2266, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:15:19,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:15:19,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.67 | bwd_microstep: 1742.42 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1741.13 | step_microstep: 2.42
[2025-11-06 18:15:19,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.61 | bwd: 1743.26 | bwd_inner: 1.97 | bwd_allreduce: 1741.17 | step: 2.49
 36%|███▌      | 1257/3507 [30:33<1:03:06,  1.68s/it]                                                     {'loss': 0.9911, 'learning_rate': 1.4860703516057364e-05, 'epoch': 0.36}
 36%|███▌      | 1257/3507 [30:33<1:03:06,  1.68s/it]tensor([[-4.5312, -2.7969,  0.8906,  1.2422, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:19,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.89 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8750, -2.5781,  0.8555,  2.2031, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -2.5938,  2.4062, -0.0703, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.0938, -3.2188,  0.0859,  2.6719, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-2.2188, -1.4844,  0.9141,  2.8438, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.1719,  1.6406, -0.3301, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4688, -5.2812, -1.2422,  0.9492, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -3.1875, -0.0723,  3.7969, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:19,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:15:19,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.41 | bwd_microstep: 38.95 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 37.75 | step_microstep: 1.91
[2025-11-06 18:15:19,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.33 | bwd: 39.81 | bwd_inner: 1.90 | bwd_allreduce: 37.79 | step: 2.00
 36%|███▌      | 1258/3507 [30:33<48:46,  1.30s/it]                                                     {'loss': 0.749, 'learning_rate': 1.4852628818767536e-05, 'epoch': 0.36}
 36%|███▌      | 1258/3507 [30:33<48:46,  1.30s/it]tensor([[-3.7656, -3.4844, -0.8984,  2.7031, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -4.0000, -1.3672,  1.8516, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:20,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0312, -4.7812, -2.0625,  2.0156, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -1.7969,  1.5156,  0.0908, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -1.8203,  1.3203,  0.5039, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.4688, -3.5625, -0.0786,  2.3594, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -1.7656,  0.7188,  1.9375, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -4.1250, -0.5664,  2.4062, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:22,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:15:22,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.55 | bwd_microstep: 2537.57 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2536.48 | step_microstep: 1.97
[2025-11-06 18:15:22,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.37 | bwd: 2538.43 | bwd_inner: 1.75 | bwd_allreduce: 2536.53 | step: 2.05
 36%|███▌      | 1259/3507 [30:36<1:06:59,  1.79s/it]                                                     {'loss': 0.8523, 'learning_rate': 1.4844549980875363e-05, 'epoch': 0.36}
 36%|███▌      | 1259/3507 [30:36<1:06:59,  1.79s/it]tensor([[-2.4531, -2.7188, -1.5625,  2.0312, -0.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -2.8906,  0.1914,  1.8438, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.4688, -0.7852,  2.1094, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:22,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.98 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.7969,  0.5547,  3.9219, -1.0547, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -3.9375,  0.1846,  1.5391, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -1.2656,  2.4219, -1.2656, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9688, -4.2188,  0.2451,  0.9180, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9844,  0.0557,  3.0312,  2.2031, -1.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:15:23,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:15:23,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.95 | bwd_microstep: 97.93 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 96.61 | step_microstep: 1.53
[2025-11-06 18:15:23,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.95 | bwd: 98.89 | bwd_inner: 2.10 | bwd_allreduce: 96.66 | step: 1.62
 36%|███▌      | 1260/3507 [30:37<52:16,  1.40s/it]                                                     {'loss': 0.1898, 'learning_rate': 1.4836467009274276e-05, 'epoch': 0.36}
 36%|███▌      | 1260/3507 [30:37<52:16,  1.40s/it]tensor([[-3.7812, -0.7930,  3.2344, -0.1514, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9219, -2.4375,  0.3086,  3.7188, -1.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:23,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.34 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-7.6875, -5.0000,  0.2910, -1.3203, -6.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -3.0156,  0.1836,  2.7812, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -4.0625, -0.1040,  0.1104, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9062, -3.4219,  1.2812, -0.0688, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2812, -1.6172,  1.4688,  1.5547, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -2.6562,  0.9258,  1.7500, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:24,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:15:24,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.72 | bwd_microstep: 1079.20 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1078.00 | step_microstep: 1.96
[2025-11-06 18:15:24,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.09 | bwd: 1080.06 | bwd_inner: 1.90 | bwd_allreduce: 1078.04 | step: 2.02
 36%|███▌      | 1261/3507 [30:38<52:55,  1.41s/it]                                                   {'loss': 0.5134, 'learning_rate': 1.482837991086123e-05, 'epoch': 0.36}
 36%|███▌      | 1261/3507 [30:38<52:55,  1.41s/it]tensor([[-2.4844,  0.6328,  3.2344, -1.7266, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.8438,  0.9375,  2.5781, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.5781, -0.3477,  3.6094, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:24,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.15
tensor([[-6.0000, -3.5625,  1.0625, -0.2021, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -5.4688, -2.3594,  1.9453, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.7812, -3.7344, -0.0684,  2.4688, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([3], device='cuda:0')
tensor([[-4.7500, -1.0312,  3.3281, -2.3594, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -2.8750,  0.6992,  2.2031, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:25,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:15:25,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.40 | bwd_microstep: 698.75 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 697.61 | step_microstep: 2.14
[2025-11-06 18:15:25,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.04 | bwd: 699.65 | bwd_inner: 1.89 | bwd_allreduce: 697.64 | step: 2.30
 36%|███▌      | 1262/3507 [30:39<49:33,  1.32s/it]                                                   {'loss': 0.1109, 'learning_rate': 1.4820288692536702e-05, 'epoch': 0.36}
 36%|███▌      | 1262/3507 [30:39<49:33,  1.32s/it]tensor([[-4.1562, -1.7266,  1.3438, -0.9961, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -4.7812, -1.9531,  2.5781, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7500, -4.9375,  0.0089,  0.8906, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:26,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.94 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.6562, -3.2188,  1.5938,  0.4512, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4688, -5.0312, -0.7773, -2.4531, -5.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -4.7500, -1.2656,  1.3516, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0000, -5.2812, -0.9531, -0.3125, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -2.5312,  0.6562,  3.3594, -1.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:28,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:15:28,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.94 | bwd_microstep: 2154.97 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2153.82 | step_microstep: 2.14
[2025-11-06 18:15:28,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.91 | bwd: 2155.92 | bwd_inner: 1.87 | bwd_allreduce: 2153.87 | step: 2.26
 36%|███▌      | 1263/3507 [30:42<1:03:34,  1.70s/it]                                                     {'loss': 0.1968, 'learning_rate': 1.4812193361204689e-05, 'epoch': 0.36}
 36%|███▌      | 1263/3507 [30:42<1:03:34,  1.70s/it]tensor([[-4.4375, -2.0469,  1.3906, -0.6328, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -2.0469,  1.4766,  0.6953, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:28,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.33 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -2.8906,  0.7148,  2.0156, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -1.6016,  3.0781, -1.2578, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -0.8477,  1.8203, -0.2363, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9375, -3.1719, -0.2852,  2.1094, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-6.5938, -5.1250, -1.1250,  0.2080, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -3.7812, -0.3086,  1.2422, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:15:28,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:15:28,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.25 | bwd_microstep: 173.54 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 172.56 | step_microstep: 1.55
[2025-11-06 18:15:28,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.60 | bwd: 174.46 | bwd_inner: 1.74 | bwd_allreduce: 172.60 | step: 1.63
 36%|███▌      | 1264/3507 [30:42<50:38,  1.35s/it]                                                     {'loss': 0.7531, 'learning_rate': 1.4804093923772691e-05, 'epoch': 0.36}
 36%|███▌      | 1264/3507 [30:42<50:38,  1.35s/it]tensor([[-4.3750, -3.4844, -0.2598,  1.8438, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2500, -2.2500,  0.8086,  2.9688, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719,  0.6094,  3.5938, -1.8906, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:29,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.41 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2656, -3.3750, -1.6016,  2.3594, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -1.9219,  2.6719, -0.2637, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -1.7422,  1.8906,  0.5703, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -4.5000, -1.3203,  0.4590, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562, -0.7031,  2.6875, -0.9688, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:15:29,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:15:29,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.01 | bwd_microstep: 20.61 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 19.54 | step_microstep: 1.63
[2025-11-06 18:15:29,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.45 | bwd: 21.52 | bwd_inner: 1.83 | bwd_allreduce: 19.58 | step: 1.71
 36%|███▌      | 1265/3507 [30:43<40:43,  1.09s/it]                                                   {'loss': 0.3377, 'learning_rate': 1.4795990387151719e-05, 'epoch': 0.36}
 36%|███▌      | 1265/3507 [30:43<40:43,  1.09s/it]tensor([[-4.5000, -3.2188,  0.2354,  1.4609, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.3125, -0.1133,  2.3438, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -1.5938,  2.8750, -1.6641, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:29,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.66 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5000, -0.6953,  2.2812, -1.8281, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -1.8750,  1.4844, -2.6250, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1875, -3.4531, -0.1035,  3.1562, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -2.7500,  2.0938,  1.0078, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -3.9531, -0.4395,  2.5000, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:31,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:15:31,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 1805.42 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1804.29 | step_microstep: 3.09
[2025-11-06 18:15:31,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.16 | bwd: 1806.31 | bwd_inner: 1.82 | bwd_allreduce: 1804.34 | step: 3.17
 36%|███▌      | 1266/3507 [30:45<53:21,  1.43s/it]                                                   {'loss': 0.1149, 'learning_rate': 1.4787882758256271e-05, 'epoch': 0.36}
 36%|███▌      | 1266/3507 [30:45<53:21,  1.43s/it]tensor([[-4.3750, -1.9141,  1.9688,  0.4941, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.0312, -3.7812, -0.1167,  1.4766, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([2], device='cuda:3')
tensor([[-4.5938, -2.9062,  1.0312,  1.5938, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -1.0938,  2.6719, -0.5312, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:31,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.08 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4375,  1.0703,  3.7812, -1.9453, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -3.0312,  1.1953,  1.4609, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -4.4375, -0.4023,  0.9453, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3906, -2.3594,  0.5000,  2.1562, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:15:32,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:15:32,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.20 | bwd_microstep: 8.66 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 7.62 | step_microstep: 2.31
[2025-11-06 18:15:32,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.31 | bwd: 9.42 | bwd_inner: 1.59 | bwd_allreduce: 7.66 | step: 2.40
 36%|███▌      | 1267/3507 [30:45<42:20,  1.13s/it]                                                   {'loss': 0.4966, 'learning_rate': 1.4779771044004347e-05, 'epoch': 0.36}
 36%|███▌      | 1267/3507 [30:45<42:20,  1.13s/it]tensor([[-3.9219, -0.3926,  3.5625, -1.9844, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469, -0.5234,  2.5156,  1.6094, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -1.7500,  2.5781,  0.2021, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:32,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.33 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.0000,  0.0220,  3.3750, -0.4590, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281, -0.5664,  2.6562, -0.8047, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7188, -0.4844,  2.6406, -2.3750, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8594, -3.5312, -1.0703,  2.0625, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -3.4062, -2.2031,  1.3828, -1.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:15:34,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.16 | optimizer_step: 0.24
[2025-11-06 18:15:34,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.82 | bwd_microstep: 1345.40 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1344.12 | step_microstep: 2.09
[2025-11-06 18:15:34,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.17 | bwd: 1346.28 | bwd_inner: 1.92 | bwd_allreduce: 1344.18 | step: 2.19
 36%|███▌      | 1268/3507 [30:47<52:52,  1.42s/it]                                                   {'loss': 0.1088, 'learning_rate': 1.4771655251317426e-05, 'epoch': 0.36}
 36%|███▌      | 1268/3507 [30:47<52:52,  1.42s/it]tensor([[-5.5938, -3.0781,  1.7031,  0.4531, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969, -0.8711,  2.2812, -0.2129, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:34,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.74 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.5312, -1.3438,  2.6562, -1.3359, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -3.2500,  0.2832,  0.7930, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -1.0703,  2.7344, -0.0306, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7656, -0.5195,  2.5312, -2.0938, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4062,  0.3027,  3.3750,  0.4160, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2812, -0.3789,  2.3281, -1.3516, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:34,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:15:34,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.77 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.84
[2025-11-06 18:15:34,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.56 | bwd: 2.82 | bwd_inner: 1.70 | bwd_allreduce: 0.93 | step: 1.95
 36%|███▌      | 1269/3507 [30:48<41:42,  1.12s/it]                                                   {'loss': 0.1574, 'learning_rate': 1.4763535387120475e-05, 'epoch': 0.36}
 36%|███▌      | 1269/3507 [30:48<41:42,  1.12s/it]tensor([[-5.1562, -3.3594,  0.7383,  1.1016, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -3.9531, -0.7695,  1.7266, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8438, -2.7188,  0.5391,  2.7188, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:34,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.78 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.22
tensor([[-5.0938, -3.4375,  0.2451,  0.7305, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -1.6328,  1.0078, -1.9297, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.9375, -1.0312,  2.9219, -0.2617, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -3.6406, -0.7383,  2.9062, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -1.6953,  1.6406, -2.2656, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:15:37,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:15:37,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.04 | bwd_microstep: 2813.73 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 2812.78 | step_microstep: 2.09
[2025-11-06 18:15:37,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.83 | bwd: 2814.73 | bwd_inner: 1.74 | bwd_allreduce: 2812.83 | step: 2.30
 36%|███▌      | 1270/3507 [30:51<1:05:38,  1.76s/it]                                                     {'loss': 0.5213, 'learning_rate': 1.4755411458341924e-05, 'epoch': 0.36}
 36%|███▌      | 1270/3507 [30:51<1:05:38,  1.76s/it]tensor([[-5.4375, -4.7188, -1.2891,  1.5000, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -4.6250, -0.7383,  1.3203, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:38,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.51 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3438, -2.5938,  1.1094,  1.3906, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -4.0000, -2.0469,  1.4844, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -3.8594,  0.8555,  1.8906, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1562, -0.0649,  3.5938, -0.2676, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -2.5000,  2.1875,  0.0933, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.0000,  1.1094,  2.0625, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:15:38,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:15:38,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.64 | bwd_microstep: 186.31 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 185.19 | step_microstep: 1.92
[2025-11-06 18:15:38,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.18 | bwd: 187.27 | bwd_inner: 1.91 | bwd_allreduce: 185.23 | step: 2.00
 36%|███▌      | 1271/3507 [30:52<52:03,  1.40s/it]                                                     {'loss': 0.3268, 'learning_rate': 1.4747283471913685e-05, 'epoch': 0.36}
 36%|███▌      | 1271/3507 [30:52<52:03,  1.40s/it]tensor([[-4.9375, -4.0938, -1.1328,  1.0312, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -1.1641,  2.9375, -2.5938, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -2.3438,  1.4062,  0.9336, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:38,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.68 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1562, -4.6562, -1.5156,  1.9844, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.4375, -7.8750, -4.0312, -0.2832, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -1.6562,  2.5000, -1.5312, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -3.3281,  1.3359, -0.9023, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5625, -2.6250, -1.1094,  2.5312, -0.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 18:15:40,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:15:40,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.87 | bwd_microstep: 1843.76 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1842.75 | step_microstep: 1.63
[2025-11-06 18:15:40,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.58 | bwd: 1844.56 | bwd_inner: 1.64 | bwd_allreduce: 1842.79 | step: 1.71
 36%|███▋      | 1272/3507 [30:54<1:01:20,  1.65s/it]                                                     {'loss': 0.5447, 'learning_rate': 1.4739151434771114e-05, 'epoch': 0.36}
 36%|███▋      | 1272/3507 [30:54<1:01:20,  1.65s/it]tensor([[-4.9688, -2.4219,  1.9609,  0.4238, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -3.5000,  0.6172, -2.5156, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -2.1250,  1.9141,  1.3594, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:40,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.97 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4375, -3.2812,  0.1602,  1.9766, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -3.5312,  0.6484,  1.2422, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2812, -4.0000,  1.3516,  1.5078, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -2.9531,  0.9727,  0.4766, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -4.6250, -2.4688,  1.8203, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:15:41,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:15:41,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.40 | bwd_microstep: 11.72 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 10.56 | step_microstep: 1.51
[2025-11-06 18:15:41,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.39 | bwd: 12.52 | bwd_inner: 1.81 | bwd_allreduce: 10.59 | step: 1.58
 36%|███▋      | 1273/3507 [30:54<47:34,  1.28s/it]                                                     {'loss': 0.3751, 'learning_rate': 1.4731015353853046e-05, 'epoch': 0.36}
 36%|███▋      | 1273/3507 [30:54<47:34,  1.28s/it]tensor([[-4.5938, -3.2656,  0.3125,  1.6016, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:41,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.00 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -5.1250, -3.1406,  1.6797, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -4.5000, -1.4609,  1.5391, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.4375, -1.2188,  2.1406, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1562, -0.2617,  1.2344, -0.8555, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4531, -0.3320,  3.7969,  0.2148, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281, -3.2656, -0.8672,  3.2812, -1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0938,  0.6133,  2.4531, -1.4062, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:15:43,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.27 | optimizer_step: 0.36
[2025-11-06 18:15:43,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.52 | bwd_microstep: 2448.03 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 2446.80 | step_microstep: 3.01
[2025-11-06 18:15:43,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.55 | bwd: 2448.99 | bwd_inner: 1.96 | bwd_allreduce: 2446.86 | step: 3.10
 36%|███▋      | 1274/3507 [30:57<1:04:55,  1.74s/it]                                                     {'loss': 0.5148, 'learning_rate': 1.4722875236101746e-05, 'epoch': 0.36}
 36%|███▋      | 1274/3507 [30:57<1:04:55,  1.74s/it]tensor([[-6.4688, -5.7500, -2.0156,  1.4609, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -1.0312,  2.1406,  0.6992, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:44,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.52 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3750, -3.7031, -0.5547,  2.3750, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -3.0781,  0.7773,  1.1094, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -5.2500, -1.9531,  1.4453, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -2.5781,  1.1719,  0.9766, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9844, -3.4531, -0.4336,  2.8438, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -5.0000, -0.7812,  1.0938, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:15:44,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:15:44,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.29 | bwd_microstep: 129.37 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 128.21 | step_microstep: 2.04
[2025-11-06 18:15:44,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.82 | bwd: 130.47 | bwd_inner: 2.09 | bwd_allreduce: 128.25 | step: 2.13
 36%|███▋      | 1275/3507 [30:58<51:28,  1.38s/it]                                                     {'loss': 0.2632, 'learning_rate': 1.4714731088462935e-05, 'epoch': 0.36}
 36%|███▋      | 1275/3507 [30:58<51:28,  1.38s/it]tensor([[-3.5000, -0.8828,  2.5625, -0.0815, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094,  0.0630,  3.0469, -1.4375, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:15:44,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.47 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0000, -3.2500, -0.6484,  1.2734, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -1.1953,  2.9844, -0.8906, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -1.6172,  2.2188,  0.9375, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0625, -0.5039,  2.2031,  2.4375, -1.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -2.8594,  1.2109,  0.8828, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -1.4844,  2.1094, -0.9102, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:15:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:15:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.67 | bwd_microstep: 1800.44 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1799.37 | step_microstep: 2.14
[2025-11-06 18:15:46,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.15 | bwd: 1801.35 | bwd_inner: 1.77 | bwd_allreduce: 1799.42 | step: 2.23
 36%|███▋      | 1276/3507 [31:00<1:00:42,  1.63s/it]                                                     {'loss': 0.9428, 'learning_rate': 1.4706582917885767e-05, 'epoch': 0.36}
 36%|███▋      | 1276/3507 [31:00<1:00:42,  1.63s/it]tensor([[-3.2656, -1.7500,  1.4375,  1.4219, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3125, -3.3438, -1.7031,  1.8672, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3281, -0.8750,  2.1719, -0.8320, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -4.3750, -0.1089,  1.2891, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:46,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.61 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.9375, -3.3281, -0.1807,  3.1250, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -4.0625, -0.6172,  3.4688, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -3.0938,  0.9492,  1.0234, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -1.3438,  2.0000, -1.2500, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:15:47,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:15:47,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.73 | bwd_microstep: 763.89 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 762.66 | step_microstep: 1.51
[2025-11-06 18:15:47,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.37 | bwd: 764.83 | bwd_inner: 1.97 | bwd_allreduce: 762.70 | step: 1.62
 36%|███▋      | 1277/3507 [31:01<56:06,  1.51s/it]                                                     {'loss': 0.2515, 'learning_rate': 1.4698430731322834e-05, 'epoch': 0.36}
 36%|███▋      | 1277/3507 [31:01<56:06,  1.51s/it]tensor([[-4.0312, -0.8242,  3.1094, -0.9375, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -0.4492,  3.1562, -1.9453, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.0938, -3.9531, -1.4219,  2.5938, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([3], device='cuda:3')
[2025-11-06 18:15:48,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.83 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[1.9531, 3.0938, 4.9375, 5.8750, 2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.7500, -0.3848,  2.6719, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5625, -3.0469,  1.4141, -0.1128, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.5312,  0.1572,  1.9062, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -3.4375,  0.6328,  0.5938, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:15:48,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:15:48,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.64 | bwd_microstep: 654.76 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 653.61 | step_microstep: 1.58
[2025-11-06 18:15:48,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.52 | bwd: 655.61 | bwd_inner: 1.82 | bwd_allreduce: 653.64 | step: 1.65
 36%|███▋      | 1278/3507 [31:02<50:27,  1.36s/it]                                                   {'loss': 0.2084, 'learning_rate': 1.469027453573015e-05, 'epoch': 0.36}
 36%|███▋      | 1278/3507 [31:02<50:27,  1.36s/it]tensor([[-3.1562, -3.2500, -1.3750,  2.5938, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:48,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.73 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -2.5000,  1.3594,  1.7969, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.4375, -1.6250,  1.9297, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -2.6406,  1.7500, -2.0781, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -3.5000, -0.2324,  2.8906, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -1.6875,  2.6094,  0.5078, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4688, -5.8125, -0.6484,  1.0000, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -1.2656,  2.8281, -2.2031, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:15:49,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.26 | optimizer_step: 0.21
[2025-11-06 18:15:49,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.52 | bwd_microstep: 252.91 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 252.08 | step_microstep: 2.11
[2025-11-06 18:15:49,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.26 | bwd: 253.65 | bwd_inner: 1.37 | bwd_allreduce: 252.13 | step: 2.20
 36%|███▋      | 1279/3507 [31:03<41:50,  1.13s/it]                                                   {'loss': 0.1272, 'learning_rate': 1.4682114338067152e-05, 'epoch': 0.36}
 36%|███▋      | 1279/3507 [31:03<41:50,  1.13s/it]tensor([[-4.4688, -3.2031,  0.6523,  2.5469, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -0.5820,  3.0625, -1.2188, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:49,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -4.5312, -2.8594,  0.3965, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.9062, -2.3750,  1.8906, -0.1416, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -3.4062, -0.2188,  2.7031, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -3.7969,  0.0344,  1.9766, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -1.5781,  2.2188,  1.8984, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406, -2.9844, -1.7969,  2.2500, -0.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:15:51,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:15:51,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.09 | bwd_microstep: 2187.43 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 2186.62 | step_microstep: 2.06
[2025-11-06 18:15:51,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.77 | bwd: 2188.05 | bwd_inner: 1.27 | bwd_allreduce: 2186.66 | step: 2.14
 36%|███▋      | 1280/3507 [31:05<57:53,  1.56s/it]                                                   {'loss': 0.8026, 'learning_rate': 1.4673950145296691e-05, 'epoch': 0.36}
 36%|███▋      | 1280/3507 [31:05<57:53,  1.56s/it]tensor([[-4.6250, -4.1250, -1.0469,  2.3281, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -3.7656,  0.4824,  2.0469, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:52,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.44 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.2422,  1.5469,  2.9844, -1.5000, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.1719,  0.6211,  4.0938, -2.5156, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -2.9844,  0.9414,  0.7852, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2188, -4.4062, -0.3105, -0.0571, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.9531, -1.2969,  2.6562, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2812, -3.1875,  1.0312,  0.6094, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:15:52,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:15:52,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.67 | bwd_microstep: 10.34 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 9.09 | step_microstep: 2.08
[2025-11-06 18:15:52,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.14 | bwd: 11.40 | bwd_inner: 2.07 | bwd_allreduce: 9.12 | step: 2.18
 37%|███▋      | 1281/3507 [31:06<45:00,  1.21s/it]                                                   {'loss': 0.6929, 'learning_rate': 1.4665781964385028e-05, 'epoch': 0.37}
 37%|███▋      | 1281/3507 [31:06<45:00,  1.21s/it]tensor([[-5.0938, -3.4219,  0.1934,  0.4395, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0781,  1.3828,  4.2500, -1.5625, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2812, -2.5000,  2.2656, -0.1738, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:52,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.22 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1250, -1.5000,  2.6562,  0.2324, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6875, -4.6875,  0.3555,  1.0859, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4062, -2.9688, -0.2266,  3.1562, -1.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -4.2188, -0.8203,  2.3281, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1953, -0.3281,  1.5469,  2.9688, -0.2354]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:15:53,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.29 | optimizer_step: 0.38
[2025-11-06 18:15:53,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.92 | bwd_microstep: 1172.12 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1171.07 | step_microstep: 2.96
[2025-11-06 18:15:53,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.15 | bwd: 1173.23 | bwd_inner: 1.92 | bwd_allreduce: 1171.13 | step: 3.05
 37%|███▋      | 1282/3507 [31:07<49:16,  1.33s/it]                                                   {'loss': 0.5626, 'learning_rate': 1.4657609802301828e-05, 'epoch': 0.37}
 37%|███▋      | 1282/3507 [31:07<49:16,  1.33s/it]tensor([[-2.8125, -0.7031,  2.0000,  0.1270, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.4062, -4.1250, -0.5625,  1.0469, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -2.3906,  1.1328,  1.2734, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:54,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.35 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.3594, -0.7344,  1.5469, -1.8047, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6406, -1.3281,  0.6250,  3.4375, -0.3613]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0312, -1.1797,  1.6641,  0.7812, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -2.4688,  1.8594, -1.7188, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[0.0874, 1.9844, 4.4062, 3.0469, 0.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:15:54,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.35 | optimizer_step: 0.29
[2025-11-06 18:15:54,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.70 | bwd_microstep: 55.84 | bwd_inner_microstep: 1.47 | bwd_allreduce_microstep: 54.20 | step_microstep: 2.84
[2025-11-06 18:15:54,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.07 | bwd: 56.94 | bwd_inner: 2.45 | bwd_allreduce: 54.27 | step: 2.95
 37%|███▋      | 1283/3507 [31:08<39:21,  1.06s/it]                                                   {'loss': 1.071, 'learning_rate': 1.4649433666020147e-05, 'epoch': 0.37}
 37%|███▋      | 1283/3507 [31:08<39:21,  1.06s/it]tensor([[-4.6875, -3.8438, -0.5195,  1.9609, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -3.4062, -1.8828,  0.9961, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -4.4375,  0.8945,  0.3008, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:54,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.27 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.2500, -4.1562, -1.3672,  3.1562, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0312, -4.7500, -0.1689,  2.3906, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -4.7500, -0.8867,  0.9414, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -3.2812,  0.1177,  2.1406, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7969, -0.2637,  2.5312, -0.2480, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:15:58,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:15:58,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.83 | bwd_microstep: 3028.35 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 3027.29 | step_microstep: 2.29
[2025-11-06 18:15:58,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 467.13 | bwd: 3029.39 | bwd_inner: 1.88 | bwd_allreduce: 3027.34 | step: 2.38
 37%|███▋      | 1284/3507 [31:11<1:07:45,  1.83s/it]                                                     {'loss': 0.4505, 'learning_rate': 1.464125356251644e-05, 'epoch': 0.37}
 37%|███▋      | 1284/3507 [31:11<1:07:45,  1.83s/it]tensor([[-3.7812, -1.7891,  1.6797,  1.2422, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -2.5312,  0.7969,  3.5781, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -4.3125, -1.6797,  1.6406, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:15:58,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.31 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3438, -2.7344,  0.6836,  1.1953, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -0.0581,  2.3906, -0.3633, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -3.4531, -0.2031,  1.4453, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0625, -4.3750, -0.2539,  0.2217, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -1.0078,  2.1406, -2.5625, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:15:58,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.15 | optimizer_step: 0.14
[2025-11-06 18:15:58,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.31 | bwd_microstep: 41.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 41.01 | step_microstep: 1.87
[2025-11-06 18:15:58,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.65 | bwd: 42.66 | bwd_inner: 1.50 | bwd_allreduce: 41.04 | step: 1.94
 37%|███▋      | 1285/3507 [31:12<52:08,  1.41s/it]                                                     {'loss': 0.7181, 'learning_rate': 1.4633069498770544e-05, 'epoch': 0.37}
 37%|███▋      | 1285/3507 [31:12<52:08,  1.41s/it]tensor([[-3.1094, -0.7969,  2.5469,  0.2598, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:15:58,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.88 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0625, -2.5781,  1.1875, -0.9609, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000,  0.2988,  3.5625, -1.0547, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.1406, -3.0156, -0.6641,  3.1250, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.3125,  0.3516,  1.2656, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -3.7188, -0.1572,  1.3203, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -3.4375,  0.3945,  1.4844, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.0312,  0.9844,  1.4375, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:00,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:16:00,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.39 | bwd_microstep: 1781.72 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1780.56 | step_microstep: 2.13
[2025-11-06 18:16:00,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.30 | bwd: 1782.63 | bwd_inner: 1.91 | bwd_allreduce: 1780.59 | step: 2.20
 37%|███▋      | 1286/3507 [31:14<1:00:26,  1.63s/it]                                                     {'loss': 0.6254, 'learning_rate': 1.4624881481765672e-05, 'epoch': 0.37}
 37%|███▋      | 1286/3507 [31:14<1:00:26,  1.63s/it]tensor([[-2.9219, -3.1406, -1.8906,  1.8281, -1.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8984,  0.5781,  2.1250, -1.1250, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:00,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.73 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -5.5000, -1.8125,  1.9062, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -4.2188, -0.4570,  1.1172, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8438e+00,  1.9073e-04,  2.8281e+00, -6.7969e-01, -2.6406e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2500, -1.5469,  2.3125, -0.2344, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5312, -2.3281,  1.0312, -0.0212, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.7031, -0.0322,  0.8906, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:16:01,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:16:01,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.69 | bwd_microstep: 44.28 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 43.02 | step_microstep: 1.47
[2025-11-06 18:16:01,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.44 | bwd: 45.26 | bwd_inner: 2.08 | bwd_allreduce: 43.05 | step: 1.55
 37%|███▋      | 1287/3507 [31:14<46:31,  1.26s/it]                                                     {'loss': 0.6665, 'learning_rate': 1.4616689518488417e-05, 'epoch': 0.37}
 37%|███▋      | 1287/3507 [31:14<46:31,  1.26s/it]tensor([[-5.3438, -3.4688,  1.0781,  1.5781, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:01,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.19 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7500, -2.8906,  0.0337,  2.3438, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.0625,  1.9141,  1.3281, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -0.8789,  3.2812, -0.7656, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -4.4062,  0.0623,  2.5781, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -3.2812,  1.1875,  1.3672, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -1.1484,  2.0312,  1.9219, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.8438, -1.5938,  2.0469, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:03,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 18:16:03,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.16 | bwd_microstep: 2173.27 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2172.18 | step_microstep: 2.23
[2025-11-06 18:16:03,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.37 | bwd: 2174.22 | bwd_inner: 1.85 | bwd_allreduce: 2172.22 | step: 2.30
 37%|███▋      | 1288/3507 [31:17<1:00:43,  1.64s/it]                                                     {'loss': 0.4451, 'learning_rate': 1.4608493615928725e-05, 'epoch': 0.37}
 37%|███▋      | 1288/3507 [31:17<1:00:43,  1.64s/it]tensor([[-1.1562,  1.5469,  3.3750, -1.0234, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -2.7656,  1.0391,  0.8555, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:03,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.08 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9844, -3.2344, -0.2773,  2.3906, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -3.1875,  0.6445,  0.5078, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -1.2578,  2.2656, -2.3125, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -3.4219,  0.3066,  2.5938, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9062,  0.6094,  4.4375, -1.1562, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7500,  1.5312,  3.6562, -1.8125, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:16:04,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:16:04,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.48 | bwd_microstep: 208.22 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 207.26 | step_microstep: 1.77
[2025-11-06 18:16:04,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.57 | bwd: 209.13 | bwd_inner: 1.70 | bwd_allreduce: 207.29 | step: 1.86
 37%|███▋      | 1289/3507 [31:17<48:01,  1.30s/it]                                                     {'loss': 0.517, 'learning_rate': 1.4600293781079923e-05, 'epoch': 0.37}
 37%|███▋      | 1289/3507 [31:17<48:01,  1.30s/it]tensor([[-5.7500, -4.7812, -1.0000,  1.7812, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6875, -2.2344,  1.0078,  2.1406, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:04,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.43 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -1.0469,  2.5938, -1.9531, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -2.8594,  1.8672,  1.1406, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -2.6562,  1.3516,  0.0118, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -1.1172,  3.2344, -1.1406, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -1.9375,  1.8047, -1.0156, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188,  0.4121,  3.6406, -2.1875, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:06,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 18:16:06,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.29 | bwd_microstep: 1981.46 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 1980.53 | step_microstep: 2.64
[2025-11-06 18:16:06,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.76 | bwd: 1982.29 | bwd_inner: 1.55 | bwd_allreduce: 1980.59 | step: 2.74
 37%|███▋      | 1290/3507 [31:20<1:00:58,  1.65s/it]                                                     {'loss': 0.1519, 'learning_rate': 1.4592090020938683e-05, 'epoch': 0.37}
 37%|███▋      | 1290/3507 [31:20<1:00:58,  1.65s/it]tensor([[-3.0312, -3.0625, -1.1094,  2.7031, -1.2109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -3.6875, -0.7070,  2.9531, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:06,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.42 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.8594, -2.2969,  1.0391,  1.7266, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094,  0.2012,  3.7500, -0.9570, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5938, -1.2031,  1.1016,  4.5000, -0.1182]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719,  0.9648,  3.9375, -1.9453, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.2500, -0.4023,  3.2031,  0.2471, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -2.4219,  0.8438, -0.0371, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:07,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:16:07,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.66 | bwd_microstep: 57.59 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 56.58 | step_microstep: 1.67
[2025-11-06 18:16:07,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.11 | bwd: 58.43 | bwd_inner: 1.66 | bwd_allreduce: 56.62 | step: 1.75
 37%|███▋      | 1291/3507 [31:20<48:12,  1.31s/it]                                                     {'loss': 0.6221, 'learning_rate': 1.4583882342505025e-05, 'epoch': 0.37}
 37%|███▋      | 1291/3507 [31:20<48:12,  1.31s/it]tensor([[-3.7812, -3.1875, -0.9180,  1.4688, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -4.7188,  0.0342,  0.4180, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -4.7812, -2.4531,  2.3906, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:07,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.48 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2344, -1.3828,  1.7812,  0.9492, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -2.2188,  2.0938, -0.3203, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1250, -0.5742,  3.1406,  1.1641, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.8906,  0.0554, -1.5938, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -1.1328,  2.7344, -0.4844, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:10,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:16:10,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.65 | bwd_microstep: 2688.93 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 2687.79 | step_microstep: 1.89
[2025-11-06 18:16:10,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 430.16 | bwd: 2689.87 | bwd_inner: 1.91 | bwd_allreduce: 2687.83 | step: 1.97
 37%|███▋      | 1292/3507 [31:24<1:08:43,  1.86s/it]                                                     {'loss': 0.2041, 'learning_rate': 1.4575670752782314e-05, 'epoch': 0.37}
 37%|███▋      | 1292/3507 [31:24<1:08:43,  1.86s/it]tensor([[-5.3438, -3.8906, -0.0525,  1.0469, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656, -2.2656,  0.9688,  3.3906, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:10,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.4688, -4.3750, -1.6641,  2.5469, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938, -1.7422,  1.8516,  1.5938, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -4.3438, -0.7930,  2.7812, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9844,  0.1113,  2.7969, -1.6172, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531, -1.5078,  1.2500,  0.0374, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -3.2812,  0.0986,  3.1094, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:16:11,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:16:11,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.74 | bwd_microstep: 529.64 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 528.69 | step_microstep: 1.69
[2025-11-06 18:16:11,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.32 | bwd: 530.75 | bwd_inner: 1.86 | bwd_allreduce: 528.74 | step: 1.78
 37%|███▋      | 1293/3507 [31:24<58:02,  1.57s/it]                                                     {'loss': 1.0065, 'learning_rate': 1.4567455258777255e-05, 'epoch': 0.37}
 37%|███▋      | 1293/3507 [31:24<58:02,  1.57s/it]tensor([[-2.9844, -3.2188, -1.6406,  2.4531, -1.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.6250, -5.5625, -0.7734, -3.9688, -7.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -2.5312,  1.1484,  1.5469, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:11,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.22 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-5.4375, -2.2031,  2.5000, -1.0156, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.1562,  0.0991,  0.8359, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1094,  1.0781,  3.4062, -1.2656, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -3.5469,  1.0781,  0.8555, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7031, -1.2734,  2.0625, -0.1099, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:13,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:16:13,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.04 | bwd_microstep: 1753.70 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1752.50 | step_microstep: 1.99
[2025-11-06 18:16:13,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.28 | bwd: 1754.71 | bwd_inner: 2.00 | bwd_allreduce: 1752.56 | step: 2.09
 37%|███▋      | 1294/3507 [31:27<1:03:57,  1.73s/it]                                                     {'loss': 0.3182, 'learning_rate': 1.4559235867499874e-05, 'epoch': 0.37}
 37%|███▋      | 1294/3507 [31:27<1:03:57,  1.73s/it]tensor([[-2.1094, -0.1328,  2.4219,  1.0703, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:13,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.10 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.9219,  0.7305,  2.7656, -0.9453, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.5938, -4.3438, -1.5469,  2.4688, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -1.8828,  2.0156,  0.7500, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -2.6250,  2.1719, -0.0160, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -4.1562, -0.2139,  2.4531, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -4.2500,  0.5820,  2.1250, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -1.0391,  2.0156, -1.3594, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:16:13,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:16:13,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.06 | bwd_microstep: 106.49 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 105.43 | step_microstep: 2.21
[2025-11-06 18:16:13,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.18 | bwd: 107.36 | bwd_inner: 1.76 | bwd_allreduce: 105.47 | step: 2.29
 37%|███▋      | 1295/3507 [31:27<50:09,  1.36s/it]                                                     {'loss': 0.9487, 'learning_rate': 1.4551012585963542e-05, 'epoch': 0.37}
 37%|███▋      | 1295/3507 [31:27<50:09,  1.36s/it]tensor([[-4.0000, -3.8281, -1.2969,  2.3906, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -4.2500, -0.3223,  2.2812, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-5.0312, -4.7188, -2.1562,  1.2578, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:1')
tensor([4], device='cuda:3')
tensor([[-3.9688, -0.4590,  3.5469, -1.1250, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:13,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.54 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.2031,  0.0791,  2.2031, -0.6133, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -4.5000, -0.6758,  1.6875, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -1.8047,  2.1875,  0.6953, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -3.3594, -0.0757,  1.7891, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:15,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.21 | optimizer_step: 0.24
[2025-11-06 18:16:15,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.36 | bwd_microstep: 1622.31 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1621.13 | step_microstep: 2.30
[2025-11-06 18:16:15,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 410.92 | bwd: 1623.31 | bwd_inner: 1.97 | bwd_allreduce: 1621.19 | step: 2.39
 37%|███▋      | 1296/3507 [31:29<58:03,  1.58s/it]                                                   {'loss': 1.0565, 'learning_rate': 1.4542785421184932e-05, 'epoch': 0.37}
 37%|███▋      | 1296/3507 [31:29<58:03,  1.58s/it]tensor([[-2.7812, -3.1562, -1.8438,  2.2031, -0.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:15,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.41 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2188, -3.3906, -1.8203,  2.1562, -1.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.6719, -1.0859,  1.7266, -0.9922, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -5.1250, -2.2656,  2.5781, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -3.0938,  1.2969,  1.2500, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -4.9375, -1.0859,  1.2578, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0156,  0.4961,  3.1562,  0.5859, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -3.9844, -0.6562, -4.4688, -6.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:16:16,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:16:16,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.02 | bwd_microstep: 194.01 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 192.85 | step_microstep: 1.60
[2025-11-06 18:16:16,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.43 | bwd: 194.79 | bwd_inner: 1.75 | bwd_allreduce: 192.89 | step: 1.68
 37%|███▋      | 1297/3507 [31:30<46:09,  1.25s/it]                                                   {'loss': 0.5895, 'learning_rate': 1.4534554380184039e-05, 'epoch': 0.37}
 37%|███▋      | 1297/3507 [31:30<46:09,  1.25s/it]tensor([[-4.2812, -3.9844, -1.2656,  2.6719, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.4688,  1.8438,  0.9961, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:16,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.70 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4688, -4.4375, -2.1875,  1.8281, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -4.8750, -1.4453,  1.9766, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -3.3906,  0.7305,  0.4121, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1875, -2.4844,  0.0349,  2.2031, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2188, -3.5781, -2.2656,  1.7734, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[5.2500, 5.7500, 6.2188, 8.0625, 5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:17,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:16:17,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.64 | bwd_microstep: 1439.19 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1438.10 | step_microstep: 1.89
[2025-11-06 18:16:17,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 247.37 | bwd: 1440.14 | bwd_inner: 1.88 | bwd_allreduce: 1438.13 | step: 1.96
 37%|███▋      | 1298/3507 [31:31<51:16,  1.39s/it]                                                   {'loss': 0.3009, 'learning_rate': 1.4526319469984158e-05, 'epoch': 0.37}
 37%|███▋      | 1298/3507 [31:31<51:16,  1.39s/it]tensor([[-3.2656, -0.5625,  2.4688, -0.5820, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656e+00, -2.6562e+00,  1.9073e-03,  2.8750e+00, -1.5625e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.9219, -0.3496,  2.8281, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5000, -5.3438, -0.8906, -1.1719, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:18,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.15 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5312, -1.1797,  3.0312, -1.1641, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -2.5781,  1.0938,  1.2734, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -2.6406,  1.1094,  2.4219, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2188, -3.4688, -0.3750,  2.5469, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:16:18,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:16:18,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.65 | bwd_microstep: 14.94 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 13.92 | step_microstep: 1.92
[2025-11-06 18:16:18,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.82 | bwd: 15.94 | bwd_inner: 1.85 | bwd_allreduce: 13.96 | step: 2.00
 37%|███▋      | 1299/3507 [31:32<40:06,  1.09s/it]                                                   {'loss': 0.9938, 'learning_rate': 1.4518080697611896e-05, 'epoch': 0.37}
 37%|███▋      | 1299/3507 [31:32<40:06,  1.09s/it]tensor([[-4.8750, -2.7500,  1.0469,  0.1904, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.3906, -0.2578,  2.0938, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -2.9844,  0.4238,  2.1875, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6406, -2.3594,  0.6953,  1.7578, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:18,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.91 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4375, -0.9570,  2.9531, -1.6875, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0312, -4.2188,  0.0830, -2.2812, -5.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.1406,  0.0708,  1.0312, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.6875, -7.0000, -1.5469,  0.5156, -6.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:21,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:16:21,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.95 | bwd_microstep: 3051.55 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 3050.46 | step_microstep: 1.80
[2025-11-06 18:16:21,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.89 | bwd: 3052.50 | bwd_inner: 1.87 | bwd_allreduce: 3050.50 | step: 1.88
 37%|███▋      | 1300/3507 [31:35<1:06:36,  1.81s/it]                                                     {'loss': 0.1985, 'learning_rate': 1.4509838070097147e-05, 'epoch': 0.37}
 37%|███▋      | 1300/3507 [31:35<1:06:36,  1.81s/it]tensor([[-3.7031, -3.3906, -0.6250,  3.2656, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:22,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.12 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-4.2188, -2.5625,  1.2500,  1.7891, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -3.4219,  0.0806, -2.2031, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -3.1719,  0.7227,  1.1641, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6719, -0.8281,  2.0625,  1.6094, -1.8359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -3.7969, -1.8125,  2.6719, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875,  0.8516,  3.9844, -1.6172, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -1.2031,  3.1406, -1.6641, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:16:22,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:16:22,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.22 | bwd_microstep: 177.90 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 176.67 | step_microstep: 1.85
[2025-11-06 18:16:22,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.35 | bwd: 179.13 | bwd_inner: 2.27 | bwd_allreduce: 176.73 | step: 1.95
 37%|███▋      | 1301/3507 [31:36<52:05,  1.42s/it]                                                     {'loss': 0.3466, 'learning_rate': 1.4501591594473098e-05, 'epoch': 0.37}
 37%|███▋      | 1301/3507 [31:36<52:05,  1.42s/it]tensor([[-5.2812, -2.4531,  2.0000, -0.2090, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8594, -0.8555,  2.2969, -1.5547, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:22,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.41 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.7734,  0.6719,  3.1875,  3.9375,  0.0183]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -4.1562, -1.3047,  1.9297, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -2.7500,  0.4160,  1.9609, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -3.7500, -1.0469,  2.6250, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -3.4531,  1.3516,  0.4180, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -2.7812,  1.1016,  1.9922, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:25,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.21 | optimizer_step: 0.24
[2025-11-06 18:16:25,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.63 | bwd_microstep: 2271.94 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2270.85 | step_microstep: 2.32
[2025-11-06 18:16:25,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 468.07 | bwd: 2272.94 | bwd_inner: 1.86 | bwd_allreduce: 2270.90 | step: 2.41
 37%|███▋      | 1302/3507 [31:38<1:07:11,  1.83s/it]                                                     {'loss': 0.412, 'learning_rate': 1.4493341277776218e-05, 'epoch': 0.37}
 37%|███▋      | 1302/3507 [31:38<1:07:11,  1.83s/it]tensor([[-2.5469, -2.8438, -1.7109,  2.0938, -0.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -5.0938, -1.4219,  1.3281, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:25,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.46 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -4.2500, -1.6328,  2.6250, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -3.8594, -2.0781,  2.0312, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.8438, -0.4219,  1.4766, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -3.7344,  0.0119,  1.5234, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -3.3125,  0.7773,  1.7734, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4375, -4.2188, -0.5078, -1.7969, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:16:25,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:16:25,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.68 | bwd_microstep: 176.86 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 175.79 | step_microstep: 1.97
[2025-11-06 18:16:25,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.18 | bwd: 177.75 | bwd_inner: 1.78 | bwd_allreduce: 175.83 | step: 2.05
 37%|███▋      | 1303/3507 [31:39<53:16,  1.45s/it]                                                     {'loss': 0.3854, 'learning_rate': 1.4485087127046256e-05, 'epoch': 0.37}
 37%|███▋      | 1303/3507 [31:39<53:16,  1.45s/it]tensor([[-3.7969, -3.5781, -0.9453,  3.0312, -1.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:25,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.19 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07
tensor([[-3.8125, -3.2031, -0.2676,  2.8438, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7344, -2.1719,  0.3301,  3.2188, -1.1484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4531, -2.4219, -0.1484,  4.1875, -0.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.2344,  0.4863,  2.3594, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -0.6406,  3.5938, -1.3594, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -4.0625,  0.3184,  2.4062, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -1.9609,  1.4297, -0.7383, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:27,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:16:27,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.11 | bwd_microstep: 1283.92 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1282.76 | step_microstep: 2.35
[2025-11-06 18:16:27,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 410.33 | bwd: 1284.80 | bwd_inner: 1.85 | bwd_allreduce: 1282.80 | step: 2.42
 37%|███▋      | 1304/3507 [31:41<56:23,  1.54s/it]                                                   {'loss': 0.3128, 'learning_rate': 1.447682914932623e-05, 'epoch': 0.37}
 37%|███▋      | 1304/3507 [31:41<56:23,  1.54s/it]tensor([[-3.0625, -2.5781,  0.1689,  3.5156, -1.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5625,  1.3750,  3.1094, -1.6016, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -0.4473,  2.2812, -0.1377, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:27,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.27 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.9219,  0.4160,  3.1875, -1.9219, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5938, -3.1719,  0.2578,  1.2188, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4688, -4.6562,  0.1816,  1.4219, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -5.0312, -0.6367,  1.1797, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -3.4531, -1.4375,  2.2188, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:16:28,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:16:28,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.79 | bwd_microstep: 313.24 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 312.15 | step_microstep: 2.15
[2025-11-06 18:16:28,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.08 | bwd: 314.15 | bwd_inner: 1.83 | bwd_allreduce: 312.19 | step: 2.22
 37%|███▋      | 1305/3507 [31:41<47:09,  1.29s/it]                                                   {'loss': 0.7328, 'learning_rate': 1.4468567351662423e-05, 'epoch': 0.37}
 37%|███▋      | 1305/3507 [31:41<47:09,  1.29s/it]tensor([[-4.5312, -4.4688, -1.9922,  2.2969, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:28,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.79 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2344, -2.0625,  1.0781,  2.8750, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6875, -5.9688, -2.2031,  1.2031, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250,  0.3633,  3.0781, -1.1562, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -2.3906,  1.0469,  0.5000, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8984,  1.0469,  2.7031, -1.7422, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.9062, -1.8750,  1.3438,  0.4688, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219, -1.3750,  2.2656,  0.5938, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:30,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:16:30,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.80 | bwd_microstep: 1814.71 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 1813.78 | step_microstep: 2.05
[2025-11-06 18:16:30,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.63 | bwd: 1815.53 | bwd_inner: 1.57 | bwd_allreduce: 1813.82 | step: 2.13
 37%|███▋      | 1306/3507 [31:44<57:24,  1.57s/it]                                                   {'loss': 0.7401, 'learning_rate': 1.4460301741104381e-05, 'epoch': 0.37}
 37%|███▋      | 1306/3507 [31:44<57:24,  1.57s/it]tensor([[-5.9062, -2.7969,  2.2500, -0.2832, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:30,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.78 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4688,  0.9844,  4.1250, -1.2188, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -3.6719,  0.9375,  2.0781, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -1.7969,  2.2500, -1.4609, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.0625,  0.2949,  1.6797, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1562, -1.9922,  0.8828,  2.4375, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2031,  0.1514,  3.2500, -1.7422, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -2.0625,  2.0781, -0.3125, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:30,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:16:30,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.68 | bwd_microstep: 190.35 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 189.36 | step_microstep: 1.95
[2025-11-06 18:16:30,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.47 | bwd: 191.12 | bwd_inner: 1.58 | bwd_allreduce: 189.39 | step: 2.04
 37%|███▋      | 1307/3507 [31:44<46:22,  1.26s/it]                                                   {'loss': 0.1345, 'learning_rate': 1.44520323247049e-05, 'epoch': 0.37}
 37%|███▋      | 1307/3507 [31:44<46:22,  1.26s/it]tensor([[-4.8750, -2.0938,  1.7266, -0.9648, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8281,  0.8359,  2.5938, -1.3828, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -3.1875,  0.9805,  1.1641, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -2.3906,  1.0078,  2.5312, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -2.7031,  1.8047, -0.1816, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -3.6875, -0.0928,  2.5000, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:31,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.30 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0312, -3.1250,  0.4590,  0.1074, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -2.9219,  1.0391,  1.9375, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:32,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.28
[2025-11-06 18:16:32,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.22 | bwd_microstep: 1036.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 1036.03 | step_microstep: 2.53
[2025-11-06 18:16:32,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.54 | bwd: 1037.95 | bwd_inner: 1.71 | bwd_allreduce: 1036.07 | step: 2.61
 37%|███▋      | 1308/3507 [31:46<52:43,  1.44s/it]                                                   {'loss': 0.4709, 'learning_rate': 1.4443759109520023e-05, 'epoch': 0.37}
 37%|███▋      | 1308/3507 [31:46<52:43,  1.44s/it]tensor([[-4.2812, -1.1797,  2.5625, -1.2266, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -2.2188,  0.8594,  1.7422, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -2.5312,  0.9531,  3.2812, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:32,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9375, -2.0469,  2.0469, -0.6836, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4062, -5.6250, -1.2266, -0.3984, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -1.4453,  1.9922,  0.1035, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -3.7812, -0.8281,  2.6875, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6875, -4.9688, -0.1816,  1.0312, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:16:33,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:16:33,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.59 | bwd_microstep: 451.54 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 450.58 | step_microstep: 1.71
[2025-11-06 18:16:33,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.23 | bwd: 452.45 | bwd_inner: 1.69 | bwd_allreduce: 450.61 | step: 1.78
 37%|███▋      | 1309/3507 [31:47<46:40,  1.27s/it]                                                   {'loss': 0.731, 'learning_rate': 1.4435482102609038e-05, 'epoch': 0.37}
 37%|███▋      | 1309/3507 [31:47<46:40,  1.27s/it]tensor([[-2.7344,  0.3359,  3.0469, -1.4922, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0938, -1.7109,  2.8125, -1.0078, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -4.0000, -1.9922,  1.9141, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:34,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.07 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.2656,  0.6641,  2.8438, -1.6328, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.3750, -3.2656, -0.9180,  2.9531, -1.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -3.7344, -1.1094,  1.4609, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -2.6719,  1.1719,  0.9570, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -2.2188,  1.8672, -1.9844, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:35,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:16:35,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.79 | bwd_microstep: 1252.80 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1251.58 | step_microstep: 1.90
[2025-11-06 18:16:35,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.88 | bwd: 1253.77 | bwd_inner: 2.00 | bwd_allreduce: 1251.63 | step: 1.99
 37%|███▋      | 1310/3507 [31:49<52:48,  1.44s/it]                                                   {'loss': 0.7736, 'learning_rate': 1.4427201311034467e-05, 'epoch': 0.37}
 37%|███▋      | 1310/3507 [31:49<52:48,  1.44s/it]tensor([[-5.0938, -3.9844, -0.6250,  1.1172, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -1.4609,  2.4062, -0.7148, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:35,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.84 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[-3.4531, -3.2188, -0.3594,  3.7188, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938, -2.2812,  1.3203,  3.0156, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.7188,  0.4414,  1.8281, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -3.4219, -0.0211,  2.7188, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -2.0469,  0.8086,  3.0312, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.8906, -0.4492,  1.0703, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:16:36,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:16:36,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.73 | bwd_microstep: 325.86 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 324.73 | step_microstep: 1.47
[2025-11-06 18:16:36,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.57 | bwd: 326.53 | bwd_inner: 1.65 | bwd_allreduce: 324.76 | step: 1.54
 37%|███▋      | 1311/3507 [31:50<44:24,  1.21s/it]                                                   {'loss': 0.3091, 'learning_rate': 1.4418916741862057e-05, 'epoch': 0.37}
 37%|███▋      | 1311/3507 [31:50<44:24,  1.21s/it]tensor([[-2.2969, -0.4688,  1.9609,  0.5977, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:36,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.17 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7812, -3.2031,  0.5000,  1.4922, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9219,  0.6445,  2.4062, -0.8789, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -0.7148,  3.0938, -1.2812, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -3.1719,  0.9883,  2.4844, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -3.6406,  0.1807,  1.7266, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -1.1953,  2.3125, -2.7188, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -2.9531,  2.1094,  0.2393, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:16:37,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:16:37,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.48 | bwd_microstep: 1284.55 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1283.60 | step_microstep: 1.82
[2025-11-06 18:16:37,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 250.67 | bwd: 1285.34 | bwd_inner: 1.57 | bwd_allreduce: 1283.64 | step: 1.89
 37%|███▋      | 1312/3507 [31:51<48:14,  1.32s/it]                                                   {'loss': 0.1877, 'learning_rate': 1.4410628402160785e-05, 'epoch': 0.37}
 37%|███▋      | 1312/3507 [31:51<48:14,  1.32s/it]tensor([[-5.5938, -4.8438, -1.5547,  1.2188, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -0.7266,  2.6094,  0.9062, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:37,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.97 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3750, -4.1250, -1.1797,  3.0000, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -0.2637,  3.2031, -1.1641, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -3.0938,  1.7578,  0.6172, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -3.0312,  0.6602,  0.2139, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.8281,  0.7031,  2.2969, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625,  0.1035,  2.1875, -1.6875, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:16:39,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:16:39,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.74 | bwd_microstep: 1009.71 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 1008.91 | step_microstep: 3.02
[2025-11-06 18:16:39,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.73 | bwd: 1010.81 | bwd_inner: 1.70 | bwd_allreduce: 1008.97 | step: 3.11
 37%|███▋      | 1313/3507 [31:52<48:44,  1.33s/it]                                                   {'loss': 0.1872, 'learning_rate': 1.4402336299002842e-05, 'epoch': 0.37}
 37%|███▋      | 1313/3507 [31:52<48:44,  1.33s/it]tensor([[-4.5000, -2.1406,  1.6406,  0.2930, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4531, -1.7188,  0.6875,  2.8125, -1.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -3.6875, -1.5000,  2.6250, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -1.3203,  2.1562,  0.1396, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:39,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.14 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.0000, -0.9961,  2.0469, -1.6875, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -5.0312, -0.2949,  1.3750, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -3.7500,  0.1797,  0.0889, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9375,  0.2490,  3.8438, -0.2148, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:16:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 471.25 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 470.16 | step_microstep: 1.54
[2025-11-06 18:16:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 447.01 | bwd: 472.36 | bwd_inner: 1.97 | bwd_allreduce: 470.22 | step: 1.64
 37%|███▋      | 1314/3507 [31:53<44:42,  1.22s/it]                                                   {'loss': 0.403, 'learning_rate': 1.4394040439463628e-05, 'epoch': 0.37}
 37%|███▋      | 1314/3507 [31:53<44:42,  1.22s/it]tensor([[-0.8750,  2.1094,  3.4688, -1.5078, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:40,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 94.19 | bwd_microstep: 1.30 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14
tensor([[-4.1875, -3.7031, -0.8633,  2.1875, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -0.6719,  3.3438, -0.8398, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -2.2031,  1.3359,  1.2734, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5938, -4.4688, -1.5547,  2.6094, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2344, -2.5625, -0.7812,  3.9062, -0.3516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719, -3.5625, -1.2891,  2.6250, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -4.2188, -0.5664, -0.0884, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:42,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:16:42,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.53 | bwd_microstep: 2580.76 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 2579.81 | step_microstep: 1.96
[2025-11-06 18:16:42,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.74 | bwd: 2582.05 | bwd_inner: 1.95 | bwd_allreduce: 2579.88 | step: 2.10
 37%|███▋      | 1315/3507 [31:56<1:03:16,  1.73s/it]                                                     {'loss': 0.6437, 'learning_rate': 1.4385740830621755e-05, 'epoch': 0.37}
 37%|███▋      | 1315/3507 [31:56<1:03:16,  1.73s/it]tensor([[-4.6562e+00, -3.6562e+00,  6.9427e-04,  2.6875e+00, -2.6250e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:43,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.00 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -2.3438,  1.4141,  1.1172, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -3.6875, -0.1025,  1.6484, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.5938, -0.3535,  2.0000, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844,  0.1167,  2.9844, -1.2266, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7031,  0.5859,  3.8438,  2.1562, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -3.5312, -0.7734,  2.9219, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -2.9062,  0.5430,  1.4297, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:16:43,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:16:43,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.12 | bwd_microstep: 88.68 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 87.82 | step_microstep: 1.62
[2025-11-06 18:16:43,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.14 | bwd: 89.39 | bwd_inner: 1.40 | bwd_allreduce: 87.86 | step: 1.70
 38%|███▊      | 1316/3507 [31:57<49:05,  1.34s/it]                                                     {'loss': 0.2014, 'learning_rate': 1.4377437479559021e-05, 'epoch': 0.38}
 38%|███▊      | 1316/3507 [31:57<49:05,  1.34s/it]tensor([[-6.2812, -5.4375, -1.4531,  1.6797, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -2.6875,  1.3047,  1.2266, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:43,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.57 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6875, -1.7266,  1.6953,  1.0312, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.3516,  2.1094,  4.6562, -0.6836, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -1.5234,  1.9375, -0.0791, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.8594,  0.9180,  1.4766, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9297,  2.3438,  5.3750,  0.8750, -1.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5234,  1.1328,  2.2812, -1.7891, -1.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:16:45,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:16:45,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.12 | bwd_microstep: 1581.55 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1580.68 | step_microstep: 2.19
[2025-11-06 18:16:45,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.72 | bwd: 1582.24 | bwd_inner: 1.37 | bwd_allreduce: 1580.73 | step: 2.27
 38%|███▊      | 1317/3507 [31:59<55:48,  1.53s/it]                                                   {'loss': 0.4357, 'learning_rate': 1.4369130393360437e-05, 'epoch': 0.38}
 38%|███▊      | 1317/3507 [31:59<55:48,  1.53s/it]tensor([[-3.7188, -0.7695,  2.5781, -0.9688, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -2.1562,  1.7344,  1.0156, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -1.0703,  1.2578, -1.4297, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -2.4375,  1.4922,  0.2676, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:45,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.09 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-4.5625e+00, -3.4375e+00, -3.1738e-03,  2.0156e+00, -2.7812e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -2.6250,  0.8047,  0.0859, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -3.2656,  1.5000,  0.3027, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.5781, -0.2734,  2.0938, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:45,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:16:45,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.73 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.95
[2025-11-06 18:16:45,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.84 | bwd: 2.92 | bwd_inner: 1.87 | bwd_allreduce: 0.87 | step: 2.07
 38%|███▊      | 1318/3507 [31:59<43:45,  1.20s/it]                                                   {'loss': 0.5767, 'learning_rate': 1.4360819579114185e-05, 'epoch': 0.38}
 38%|███▊      | 1318/3507 [31:59<43:45,  1.20s/it]tensor([[-4.9688, -3.7188,  0.0947,  2.0312, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:46,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.82 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8125, -3.6719, -0.2383,  1.6953, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -2.4375, -0.3125,  2.8125, -1.0859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5156,  1.1094,  2.7031, -1.3438, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.2188, -0.1572,  2.7500, -1.7891, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -3.1094,  0.9609,  2.5312, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281, -1.4297,  1.9297,  3.2969, -1.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -2.1719,  1.9219,  1.8125, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:49,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:16:49,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.79 | bwd_microstep: 3830.75 | bwd_inner_microstep: 7.30 | bwd_allreduce_microstep: 3823.35 | step_microstep: 2.52
[2025-11-06 18:16:49,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.63 | bwd: 3831.46 | bwd_inner: 7.91 | bwd_allreduce: 3823.40 | step: 2.61
 38%|███▊      | 1319/3507 [32:03<1:16:03,  2.09s/it]                                                     {'loss': 0.4168, 'learning_rate': 1.4352505043911634e-05, 'epoch': 0.38}
 38%|███▊      | 1319/3507 [32:03<1:16:03,  2.09s/it]tensor([[-4.9375, -3.5781,  0.2637,  1.7500, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:50,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.98 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.0938,  0.7852,  3.5469, -0.1050, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7969, -1.7109,  1.3516,  3.1719, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8438, -2.2031, -1.1484,  2.5625, -0.2432]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3672,  1.5625,  3.3906, -0.8828, -1.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0938, -2.1406,  2.4531, -0.1050, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8438, -5.4375, -2.1406,  1.7188, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -0.9531,  3.3125, -0.6289, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:16:50,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:16:50,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.42 | bwd_microstep: 144.22 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 143.09 | step_microstep: 2.79
[2025-11-06 18:16:50,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.43 | bwd: 145.04 | bwd_inner: 1.72 | bwd_allreduce: 143.15 | step: 2.90
 38%|███▊      | 1320/3507 [32:04<58:34,  1.61s/it]                                                     {'loss': 0.3405, 'learning_rate': 1.4344186794847326e-05, 'epoch': 0.38}
 38%|███▊      | 1320/3507 [32:04<58:34,  1.61s/it]tensor([[-7.7500, -5.9375, -1.5078, -1.1172, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -2.3438,  1.5312,  1.8828, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:50,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.98 | bwd_microstep: 1.62 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.23
tensor([[-5.4688, -3.1406,  1.5625,  1.1250, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -4.7812, -0.7266,  1.5312, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -3.3438, -1.1328,  2.0156, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1250, -3.4062, -2.1406,  1.4453, -1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-6.2812, -4.1562,  0.5469,  0.6211, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[6.2188, 7.3125, 7.3125, 6.9688, 5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:52,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:16:52,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.74 | bwd_microstep: 1402.81 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1401.66 | step_microstep: 1.70
[2025-11-06 18:16:52,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.74 | bwd: 1404.43 | bwd_inner: 2.33 | bwd_allreduce: 1401.79 | step: 1.94
 38%|███▊      | 1321/3507 [32:06<1:01:15,  1.68s/it]                                                     {'loss': 0.8633, 'learning_rate': 1.433586483901897e-05, 'epoch': 0.38}
 38%|███▊      | 1321/3507 [32:06<1:01:15,  1.68s/it]tensor([[-1.4844,  1.5156,  3.1719, -1.6641, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -4.2812, -1.5156,  2.0781, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -4.0312, -0.3027,  2.5469, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:52,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.08 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.3125, -2.7188,  0.7734,  1.3672, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3438, -0.9609,  2.0938,  3.2344, -1.1641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -1.9688,  1.0469,  1.4844, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.0781,  0.0378,  2.2500, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8438, -4.5938,  0.4160,  0.3379, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:16:52,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:16:52,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.39 | bwd_microstep: 2.21 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.66
[2025-11-06 18:16:52,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.50 | bwd: 3.44 | bwd_inner: 2.47 | bwd_allreduce: 0.84 | step: 1.76
 38%|███▊      | 1322/3507 [32:06<47:40,  1.31s/it]                                                     {'loss': 0.3709, 'learning_rate': 1.4327539183527447e-05, 'epoch': 0.38}
 38%|███▊      | 1322/3507 [32:06<47:40,  1.31s/it]tensor([[-5.4062, -3.4219,  0.9844,  1.3047, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -3.3281, -1.4531,  2.5000, -1.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -5.2812, -1.1875,  1.2734, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:52,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.93 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -1.2500,  1.6094, -2.2812, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.1875, -4.8750, -0.7812,  1.0625, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -3.3438, -2.0000,  1.5000, -1.3359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7031,  0.6719,  4.1875, -0.4277, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -0.4805,  2.7031, -1.2344, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:16:55,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:16:55,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.22 | bwd_microstep: 2535.00 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2533.92 | step_microstep: 2.33
[2025-11-06 18:16:55,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.18 | bwd: 2535.88 | bwd_inner: 1.77 | bwd_allreduce: 2533.97 | step: 2.41
 38%|███▊      | 1323/3507 [32:09<1:05:25,  1.80s/it]                                                     {'loss': 0.8968, 'learning_rate': 1.4319209835476783e-05, 'epoch': 0.38}
 38%|███▊      | 1323/3507 [32:09<1:05:25,  1.80s/it]tensor([[-5.1875, -4.5625, -1.3203,  1.4375, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:55,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 99.01 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0312,  0.9336,  3.6562, -0.3047, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -3.7656, -2.1094,  1.6016, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -3.5000,  0.6445,  0.2969, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -3.7188, -1.4844,  3.1094, -1.3672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -3.8750, -1.0000,  2.8125, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-10.8750,  -9.0625,  -4.0625,  -2.9688,  -8.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3535,  2.1562,  3.9688,  1.0391, -0.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:16:56,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:16:56,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.46 | bwd_microstep: 73.90 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 72.59 | step_microstep: 1.80
[2025-11-06 18:16:56,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.48 | bwd: 74.68 | bwd_inner: 1.92 | bwd_allreduce: 72.62 | step: 1.87
 38%|███▊      | 1324/3507 [32:09<50:33,  1.39s/it]                                                     {'loss': 0.2119, 'learning_rate': 1.4310876801974165e-05, 'epoch': 0.38}
 38%|███▊      | 1324/3507 [32:09<50:33,  1.39s/it]tensor([[-3.5000, -3.4062, -0.9492,  3.0000, -1.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -3.9219, -2.3125,  1.7188, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:16:56,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.23 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6719, -3.3438, -0.9219,  2.2500, -1.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -3.0156, -0.2520,  1.9062, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7500, -0.0109,  2.5938, -1.1484, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0938, -2.9688, -0.5312,  3.6094, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0312, -1.0625,  1.5938,  3.4219, -0.7695]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9141,  1.9062,  2.7500, -1.8750, -1.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:16:57,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 18:16:57,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.58 | bwd_microstep: 811.48 | bwd_inner_microstep: 2.73 | bwd_allreduce_microstep: 808.64 | step_microstep: 2.61
[2025-11-06 18:16:57,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.84 | bwd: 812.24 | bwd_inner: 3.38 | bwd_allreduce: 808.69 | step: 2.71
 38%|███▊      | 1325/3507 [32:11<48:23,  1.33s/it]                                                   {'loss': 0.6487, 'learning_rate': 1.4302540090129916e-05, 'epoch': 0.38}
 38%|███▊      | 1325/3507 [32:11<48:23,  1.33s/it]tensor([[-3.7500, -0.3457,  3.3125, -1.4688, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.3750,  0.1963,  0.2129, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -1.0078,  2.0469, -2.8594, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.6562, -3.1406,  0.0708,  3.6875, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -0.7891,  2.9531, -1.0859, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6875,  0.0483,  2.8438,  0.0154, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -1.6953,  1.8984, -0.5156, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:58,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.42 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5781, -1.0547,  1.4297,  4.5938, -0.1128]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:16:58,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:16:58,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.92 | bwd_microstep: 1.91 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.60
[2025-11-06 18:16:58,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.36 | bwd: 2.64 | bwd_inner: 1.67 | bwd_allreduce: 0.82 | step: 2.68
 38%|███▊      | 1326/3507 [32:12<44:55,  1.24s/it]                                                   {'loss': 0.83, 'learning_rate': 1.4294199707057505e-05, 'epoch': 0.38}
 38%|███▊      | 1326/3507 [32:12<44:55,  1.24s/it]tensor([[-5.7188, -3.6562,  1.0391,  1.3203, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -1.3516,  2.2344,  0.7344, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7500, -1.6406,  0.3184,  4.2188, -0.0649]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:16:58,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.14 | bwd_microstep: 5.78 | bwd_inner_microstep: 5.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.7812, -1.1797,  3.4219, -1.1953, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3750, -1.1953,  0.4902,  3.4375, -0.0515]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -3.6719,  0.1973,  1.7109, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.4922,  0.7148,  3.2969,  1.4766, -1.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -0.1001,  2.6406, -2.3438, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:02,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:17:02,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.65 | bwd_microstep: 3461.73 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 3460.63 | step_microstep: 3.45
[2025-11-06 18:17:02,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.81 | bwd: 3467.51 | bwd_inner: 6.66 | bwd_allreduce: 3460.69 | step: 3.54
 38%|███▊      | 1327/3507 [32:15<1:13:05,  2.01s/it]                                                     {'loss': 1.3095, 'learning_rate': 1.4285855659873532e-05, 'epoch': 0.38}
 38%|███▊      | 1327/3507 [32:15<1:13:05,  2.01s/it]tensor([[-3.7031, -2.3438,  0.7109,  1.3906, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:02,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.51 | bwd_microstep: 1.33 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3125, -1.7344,  2.2656,  0.7930, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.0938, -0.7344,  2.2344, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6562, -2.9531, -1.6250,  2.1562, -0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -0.8516,  2.2188,  0.7148, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.0000,  0.5078,  2.2031, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -1.3438,  2.0781,  0.8633, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5000, -1.2266,  2.2188,  0.8750, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:02,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:17:02,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.19 | bwd_microstep: 111.46 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 110.65 | step_microstep: 1.66
[2025-11-06 18:17:02,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.73 | bwd: 112.78 | bwd_inner: 1.92 | bwd_allreduce: 110.70 | step: 1.74
 38%|███▊      | 1328/3507 [32:16<55:57,  1.54s/it]                                                     {'loss': 0.5642, 'learning_rate': 1.4277507955697716e-05, 'epoch': 0.38}
 38%|███▊      | 1328/3507 [32:16<55:57,  1.54s/it]tensor([[-5.3438, -3.2812,  0.6211, -0.1982, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -1.3750,  2.7188, -0.7305, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.8594, -1.1016,  2.7188, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -1.8047,  2.7031,  1.8047, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:02,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.84 | bwd_microstep: 1.63 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.5469,  0.1001,  3.1250,  0.2773, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156,  0.0664,  3.2188,  0.7812, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.4062,  0.1279,  1.3750, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2344,  0.5156,  3.1875, -0.4082, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:06,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.43 | optimizer_step: 0.56
[2025-11-06 18:17:06,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.65 | bwd_microstep: 2954.69 | bwd_inner_microstep: 10.43 | bwd_allreduce_microstep: 2944.07 | step_microstep: 4.00
[2025-11-06 18:17:06,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 421.52 | bwd: 2956.33 | bwd_inner: 11.96 | bwd_allreduce: 2944.15 | step: 4.09
 38%|███▊      | 1329/3507 [32:19<1:16:29,  2.11s/it]                                                     {'loss': 0.5341, 'learning_rate': 1.4269156601652903e-05, 'epoch': 0.38}
 38%|███▊      | 1329/3507 [32:19<1:16:29,  2.11s/it]tensor([[-7.7188, -6.1875, -1.8047, -0.8203, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.5312, -0.3613,  1.9297, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -5.0625, -1.1562,  0.2441, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:06,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.95 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[1.2188, 4.2500, 5.5938, 0.6016, 0.2402]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5312, -4.7188,  0.1191,  1.4375, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -2.6719,  1.5859,  1.2344, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -1.8047,  2.0781, -0.5586, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0625, -5.5938, -2.1562,  1.4844, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:17:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:17:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.15 | bwd_microstep: 11.50 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 10.30 | step_microstep: 1.73
[2025-11-06 18:17:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 453.08 | bwd: 12.46 | bwd_inner: 1.98 | bwd_allreduce: 10.34 | step: 1.82
 38%|███▊      | 1330/3507 [32:20<59:06,  1.63s/it]                                                     {'loss': 0.2293, 'learning_rate': 1.4260801604865057e-05, 'epoch': 0.38}
 38%|███▊      | 1330/3507 [32:20<59:06,  1.63s/it]tensor([[-5.2188, -3.1562,  1.2188,  1.3438, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -2.5625,  2.1406,  0.1289, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -3.8906, -2.4688,  1.0078, -1.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:06,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.77 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5312, -3.3906,  0.0115,  1.8281, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -2.5312,  0.3340,  2.3750, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4844, -3.3281, -1.1641,  2.1250, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -2.4531,  2.0938,  0.9688, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7500, -1.6172,  2.7812, -0.7383, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:07,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:17:07,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.78 | bwd_microstep: 678.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 677.03 | step_microstep: 1.95
[2025-11-06 18:17:07,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.58 | bwd: 678.88 | bwd_inner: 1.68 | bwd_allreduce: 677.07 | step: 2.03
 38%|███▊      | 1331/3507 [32:21<52:51,  1.46s/it]                                                   {'loss': 0.4419, 'learning_rate': 1.425244297246324e-05, 'epoch': 0.38}
 38%|███▊      | 1331/3507 [32:21<52:51,  1.46s/it]tensor([[-5.0938, -2.6562,  0.7812, -1.1016, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:07,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.64 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.9062, -1.8047,  2.6719, -0.2158, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7500, -0.9258,  1.4375,  0.1426, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -1.2422,  1.8906, -0.0659, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -3.4844, -1.4141,  1.8438, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -2.9062,  1.5938,  1.4375, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9375, -4.5625, -0.4629,  1.1953, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5156,  1.3438,  3.0469, -1.2500, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:17:08,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:17:08,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.60 | bwd_microstep: 64.25 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 63.27 | step_microstep: 1.45
[2025-11-06 18:17:08,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.27 | bwd: 65.13 | bwd_inner: 1.71 | bwd_allreduce: 63.30 | step: 1.52
 38%|███▊      | 1332/3507 [32:21<41:22,  1.14s/it]                                                   {'loss': 0.4636, 'learning_rate': 1.424408071157963e-05, 'epoch': 0.38}
 38%|███▊      | 1332/3507 [32:21<41:22,  1.14s/it]tensor([[-4.7188, -4.3125, -0.9727,  2.9531, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:08,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.45 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -3.5469, -0.0552,  2.8594, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -2.7500,  1.1953,  1.5859, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -4.2188,  0.8477,  0.6992, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8750, -3.0312,  1.7344, -0.3926, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.9531, -1.3359,  2.5469, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -3.2188, -0.0237,  1.7188, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7500, -0.3535,  2.2969,  0.2080, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:08,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:17:08,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.95 | bwd_microstep: 660.31 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 659.20 | step_microstep: 1.80
[2025-11-06 18:17:08,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.42 | bwd: 661.32 | bwd_inner: 1.94 | bwd_allreduce: 659.24 | step: 1.88
 38%|███▊      | 1333/3507 [32:22<39:30,  1.09s/it]                                                   {'loss': 0.2216, 'learning_rate': 1.4235714829349483e-05, 'epoch': 0.38}
 38%|███▊      | 1333/3507 [32:22<39:30,  1.09s/it]tensor([[-0.9688,  1.5938,  2.3594, -1.4453, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.6719,  0.1504,  2.7969, -0.8984, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -5.5000, -2.9375,  1.3906, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:10,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.32 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7500, -4.1562, -0.8477,  2.5781, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -1.5859,  2.0625,  1.3281, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -1.4688,  1.3438,  0.7383, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2188, -3.5156, -0.2324,  2.8594, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8125, -0.1279,  2.6094, -0.4336, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:11,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:17:11,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.24 | bwd_microstep: 1251.13 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 1250.28 | step_microstep: 2.50
[2025-11-06 18:17:11,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.59 | bwd: 1252.34 | bwd_inner: 1.88 | bwd_allreduce: 1250.32 | step: 2.57
 38%|███▊      | 1334/3507 [32:25<58:33,  1.62s/it]                                                   {'loss': 0.306, 'learning_rate': 1.422734533291116e-05, 'epoch': 0.38}
 38%|███▊      | 1334/3507 [32:25<58:33,  1.62s/it]tensor([[-4.1562, -2.7188,  1.0547,  2.4219, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -2.4375,  1.0703,  0.7539, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:12,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.90 | bwd_microstep: 1.14 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.7656, -2.9688,  0.2617,  3.0625, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.2969,  0.0649,  0.6914, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531, -0.7891,  2.3906, -0.0320, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -2.7344,  0.8242,  1.3125, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -4.0938, -0.4395,  3.1562, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.1250,  0.9688,  1.7422, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:17:12,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:17:12,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.27 | bwd_microstep: 122.44 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 121.29 | step_microstep: 1.94
[2025-11-06 18:17:12,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.23 | bwd: 123.57 | bwd_inner: 2.01 | bwd_allreduce: 121.35 | step: 2.05
 38%|███▊      | 1335/3507 [32:26<46:51,  1.29s/it]                                                   {'loss': 0.5883, 'learning_rate': 1.4218972229406103e-05, 'epoch': 0.38}
 38%|███▊      | 1335/3507 [32:26<46:51,  1.29s/it]tensor([[-2.3906, -0.0752,  1.5625, -1.6875, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:12,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 78.88 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1562, -3.1875,  0.2500,  2.7031, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -1.7344,  2.4688, -1.4531, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -1.1875,  2.7031,  1.0547, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -3.2344,  0.9102,  1.0156, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -3.9062,  0.9688,  1.5781, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5312, -3.9062, -2.5938,  1.1406, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.3438, -5.1250, -0.9258,  1.3672, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:17:14,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:17:14,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.04 | bwd_microstep: 356.17 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 355.12 | step_microstep: 1.72
[2025-11-06 18:17:14,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.94 | bwd: 357.35 | bwd_inner: 2.05 | bwd_allreduce: 355.16 | step: 1.80
 38%|███▊      | 1336/3507 [32:28<53:25,  1.48s/it]                                                   {'loss': 0.6689, 'learning_rate': 1.4210595525978826e-05, 'epoch': 0.38}
 38%|███▊      | 1336/3507 [32:28<53:25,  1.48s/it]tensor([[-4.2188, -3.3281,  0.0796,  2.7188, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2812, -2.5000, -1.5938,  1.4688, -0.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -0.8398,  3.0469, -0.4199, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:14,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.25 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7344, -1.8203,  1.7188,  1.7109, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3105,  2.1719,  3.5312, -0.2275, -0.7773]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.9375, -4.0625,  0.5703,  1.0938, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7344, -0.7812,  2.5156, -0.8750, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438,  0.1289,  3.3906, -0.2676, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:14,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:17:14,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.70 | bwd_microstep: 24.33 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 23.21 | step_microstep: 2.60
[2025-11-06 18:17:14,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.98 | bwd: 25.25 | bwd_inner: 1.86 | bwd_allreduce: 23.25 | step: 2.68
 38%|███▊      | 1337/3507 [32:28<42:03,  1.16s/it]                                                   {'loss': 0.4744, 'learning_rate': 1.4202215229776917e-05, 'epoch': 0.38}
 38%|███▊      | 1337/3507 [32:28<42:03,  1.16s/it]tensor([[-3.3594, -0.7344,  2.2656, -0.6953, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -4.0938, -1.4141,  1.7188, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -3.4531, -0.4688,  1.9766, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:14,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.05 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.9062, -4.5000, -1.5781,  1.7422, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -2.2969,  1.6172,  2.3281, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -1.6016,  1.7422,  0.1406, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -4.0312, -0.2227,  0.8750, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.8125, -3.9531,  0.1875, -5.1562, -7.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:16,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:17:16,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.01 | bwd_microstep: 1.83 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.68 | step_microstep: 1.66
[2025-11-06 18:17:16,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.08 | bwd: 2.83 | bwd_inner: 1.95 | bwd_allreduce: 0.73 | step: 1.76
 38%|███▊      | 1338/3507 [32:30<49:26,  1.37s/it]                                                   {'loss': 0.2484, 'learning_rate': 1.4193831347951034e-05, 'epoch': 0.38}
 38%|███▊      | 1338/3507 [32:30<49:26,  1.37s/it]tensor([[-3.8281, -0.4453,  2.9375, -1.4141, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -2.5625,  0.4160,  1.8594, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -4.8438, -0.9766,  1.1406, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -3.0938,  0.6836,  2.5938, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:16,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.97 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.4609,  0.6562,  2.9375,  1.3516, -1.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -2.6875,  0.8672,  1.0938, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9062, -2.0312,  1.3203,  4.1562, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.4160, 2.3594, 4.2500, 2.8594, 0.4668]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:17:19,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.23 | optimizer_step: 0.33
[2025-11-06 18:17:19,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.10 | bwd_microstep: 2612.87 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 2611.85 | step_microstep: 2.34
[2025-11-06 18:17:19,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.08 | bwd: 2613.64 | bwd_inner: 1.59 | bwd_allreduce: 2611.90 | step: 2.42
 38%|███▊      | 1339/3507 [32:33<1:07:18,  1.86s/it]                                                     {'loss': 0.4099, 'learning_rate': 1.4185443887654891e-05, 'epoch': 0.38}
 38%|███▊      | 1339/3507 [32:33<1:07:18,  1.86s/it]tensor([[-2.8438, -3.1562, -1.8594,  1.7969, -1.0859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:17:19,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.72 | bwd_microstep: 1.29 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6719, -1.6953,  0.8789,  2.1719, -1.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -3.2344,  0.2129,  1.5781, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -3.3281, -0.8008,  2.8125, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -4.1562,  0.1001,  1.6094, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5469, -1.7188,  1.5000,  1.2656, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -1.5156,  1.8516, -0.0540, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -3.1719, -0.5352,  3.0000, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:17:20,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:17:20,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.83 | bwd_microstep: 174.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 173.04 | step_microstep: 1.75
[2025-11-06 18:17:20,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 269.57 | bwd: 175.36 | bwd_inner: 2.15 | bwd_allreduce: 173.08 | step: 1.84
 38%|███▊      | 1340/3507 [32:33<52:14,  1.45s/it]                                                     {'loss': 0.6044, 'learning_rate': 1.4177052856045256e-05, 'epoch': 0.38}
 38%|███▊      | 1340/3507 [32:33<52:14,  1.45s/it]tensor([[-1.6875,  1.4453,  3.2812, -1.9375, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -0.6719,  1.5781, -2.3281, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -3.2812, -0.9336,  3.2031, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.4062, -0.0184,  3.0625, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:20,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.72 | bwd_microstep: 1.28 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.7266,  1.4375,  3.3594, -1.5547, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.1875, -2.5000,  0.8984,  1.3047, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0312,  0.1934,  3.2344, -1.1484, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -2.0312,  2.2031, -1.3438, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:21,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:17:21,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.99 | bwd_microstep: 607.64 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 606.81 | step_microstep: 1.66
[2025-11-06 18:17:21,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 515.73 | bwd: 608.91 | bwd_inner: 1.92 | bwd_allreduce: 606.85 | step: 1.74
 38%|███▊      | 1341/3507 [32:35<49:10,  1.36s/it]                                                   {'loss': 0.4376, 'learning_rate': 1.4168658260281944e-05, 'epoch': 0.38}
 38%|███▊      | 1341/3507 [32:35<49:10,  1.36s/it]tensor([[-5.3438, -3.9062, -0.4863,  0.3750, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -4.7188, -2.1094,  2.3438, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -4.9062, -1.5781,  2.3125, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:21,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.05 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.1953,  1.6562,  3.0156, -1.5391, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -2.6406,  0.9102,  2.2969, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -2.1250,  1.8672, -0.1729, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -1.6562,  2.4531, -0.1533, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -2.4688,  2.2656,  0.1309, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:17:22,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:17:22,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.26 | bwd_microstep: 622.91 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 622.06 | step_microstep: 2.21
[2025-11-06 18:17:22,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.35 | bwd: 623.80 | bwd_inner: 1.53 | bwd_allreduce: 622.12 | step: 2.31
 38%|███▊      | 1342/3507 [32:36<45:12,  1.25s/it]                                                   {'loss': 0.1592, 'learning_rate': 1.4160260107527812e-05, 'epoch': 0.38}
 38%|███▊      | 1342/3507 [32:36<45:12,  1.25s/it]tensor([[4.9688, 7.2188, 8.3125, 5.4062, 3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.9531, -0.0228,  2.2188, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:22,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.88 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -3.5938,  0.4043,  1.7344, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1250,  1.1875,  4.3438, -0.2656, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -3.0625,  0.8320,  0.4609, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8594, -1.9141,  2.0469,  2.2500, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -2.8125, -0.8398,  1.9766, -1.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2656,  0.6797,  2.4531, -1.8203, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:17:22,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:17:22,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.75 | bwd_microstep: 126.90 | bwd_inner_microstep: 4.34 | bwd_allreduce_microstep: 122.48 | step_microstep: 1.68
[2025-11-06 18:17:22,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.66 | bwd: 127.64 | bwd_inner: 4.97 | bwd_allreduce: 122.52 | step: 1.76
 38%|███▊      | 1343/3507 [32:36<37:31,  1.04s/it]                                                   {'loss': 0.3477, 'learning_rate': 1.4151858404948748e-05, 'epoch': 0.38}
 38%|███▊      | 1343/3507 [32:36<37:31,  1.04s/it]tensor([[-3.8438, -2.0625,  0.8828,  0.5156, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:22,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.80 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.6250, -2.9062,  1.2578, -0.6445, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8359, -0.7227,  1.2891,  2.0312, -0.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -3.7344, -0.1641,  2.2031, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1562, -1.8516,  0.1328,  3.0625, -0.6367]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -1.0391,  2.0156, -0.6523, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -4.3125, -0.2500,  2.1719, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -2.7656,  0.9570,  2.3906, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:17:25,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 18:17:25,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.51 | bwd_microstep: 2494.33 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2493.22 | step_microstep: 2.01
[2025-11-06 18:17:25,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.34 | bwd: 2495.19 | bwd_inner: 1.78 | bwd_allreduce: 2493.26 | step: 2.10
 38%|███▊      | 1344/3507 [32:39<57:08,  1.59s/it]                                                   {'loss': 0.275, 'learning_rate': 1.4143453159713675e-05, 'epoch': 0.38}
 38%|███▊      | 1344/3507 [32:39<57:08,  1.59s/it]tensor([[-3.8906, -3.2188, -0.4473,  1.8047, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4297,  1.6562,  3.0469, -2.1094, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:17:25,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.71 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.5938, -1.6016,  1.1797,  0.0332, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -1.3047,  1.9609,  0.7422, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6875,  0.1040,  2.6875, -0.9648, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1250,  0.1670,  3.3281, -1.3203, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.3906, -1.7031,  0.1348,  4.8125,  0.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -1.9062,  0.9805,  2.1406, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:17:28,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:17:28,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.25 | bwd_microstep: 2341.17 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2339.96 | step_microstep: 1.78
[2025-11-06 18:17:28,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 294.99 | bwd: 2341.95 | bwd_inner: 1.83 | bwd_allreduce: 2340.00 | step: 1.85
 38%|███▊      | 1345/3507 [32:42<1:08:49,  1.91s/it]                                                     {'loss': 0.3604, 'learning_rate': 1.4135044378994538e-05, 'epoch': 0.38}
 38%|███▊      | 1345/3507 [32:42<1:08:49,  1.91s/it]tensor([[-4.4688, -2.0000,  1.6562,  0.2432, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:28,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.85 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5000, -3.8438,  0.4844,  1.4688, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.8281, -0.4141,  2.4219, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250, -1.6719,  1.6719,  2.8125, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -1.7266,  2.6094,  0.2520, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -2.2344,  1.5938,  0.6836, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -1.9766,  1.6875,  0.8203, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5000,  0.4043,  2.8281, -1.0078, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:28,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:17:28,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.86 | bwd_microstep: 8.98 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 7.89 | step_microstep: 1.79
[2025-11-06 18:17:28,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.74 | bwd: 9.91 | bwd_inner: 1.86 | bwd_allreduce: 7.93 | step: 1.87
 38%|███▊      | 1346/3507 [32:42<52:57,  1.47s/it]                                                     {'loss': 0.4929, 'learning_rate': 1.4126632069966292e-05, 'epoch': 0.38}
 38%|███▊      | 1346/3507 [32:42<52:57,  1.47s/it]tensor([[-6.0625, -4.0625,  0.7734,  1.1406, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:28,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.42 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.5156,  1.5859,  4.3125, -0.0747, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.4219, -3.1875, -0.7109,  2.6875, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([3], device='cuda:1')
tensor([[-2.9531,  0.4277,  3.5469, -1.4844, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -4.8438, -1.4453,  1.2266, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3105, 2.0469, 4.2188, 3.6719, 0.6133]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -1.6406,  2.3281, -1.4844, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -1.1562,  2.8906, -0.2285, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:31,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.25 | optimizer_step: 0.38
[2025-11-06 18:17:31,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 2180.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 2180.00 | step_microstep: 2.95
[2025-11-06 18:17:31,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.09 | bwd: 2181.72 | bwd_inner: 1.50 | bwd_allreduce: 2180.06 | step: 3.04
 38%|███▊      | 1347/3507 [32:45<1:04:51,  1.80s/it]                                                     {'loss': 0.2275, 'learning_rate': 1.411821623980691e-05, 'epoch': 0.38}
 38%|███▊      | 1347/3507 [32:45<1:04:51,  1.80s/it]tensor([[-3.9062, -1.1562,  2.1406, -0.7617, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -4.4062, -1.2500,  2.5781, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:31,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.22 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.1250,  0.3984,  4.0312, -0.6406, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -3.4688,  0.8672, -0.1016, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -0.1025,  3.8750, -1.8438, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -2.2031,  0.7930, -0.8984, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062, -0.5781,  2.0156, -1.5078, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -4.3438, -1.0547,  2.9375, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:17:31,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:17:31,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.79 | bwd_microstep: 217.42 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 216.39 | step_microstep: 2.12
[2025-11-06 18:17:31,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.04 | bwd: 218.39 | bwd_inner: 1.81 | bwd_allreduce: 216.43 | step: 2.21
 38%|███▊      | 1348/3507 [32:45<51:23,  1.43s/it]                                                     {'loss': 0.2305, 'learning_rate': 1.4109796895697368e-05, 'epoch': 0.38}
 38%|███▊      | 1348/3507 [32:45<51:23,  1.43s/it]tensor([[-2.6562, -2.7031, -0.6016,  3.5469, -0.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -4.1562, -1.1328,  2.2500, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.1562, -2.4688,  0.4648,  0.2832, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([2], device='cuda:1')
[2025-11-06 18:17:32,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.00 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.12
tensor([[-6.0000, -3.8125,  1.1719,  1.1094, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7969, -2.0625,  0.2734,  2.2500, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.6406,  0.0618,  2.4062, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938, -3.8906, -0.8867,  3.2031, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -0.8086,  3.3125, -1.1719, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:17:33,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:17:33,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.94 | bwd_microstep: 1096.48 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1095.42 | step_microstep: 1.54
[2025-11-06 18:17:33,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.97 | bwd: 1097.27 | bwd_inner: 1.64 | bwd_allreduce: 1095.47 | step: 1.66
 38%|███▊      | 1349/3507 [32:47<52:17,  1.45s/it]                                                   {'loss': 0.2232, 'learning_rate': 1.4101374044821639e-05, 'epoch': 0.38}
 38%|███▊      | 1349/3507 [32:47<52:17,  1.45s/it]tensor([[-5.5938, -2.6250,  2.5156,  0.4668, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:33,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.13 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5938, -3.8750, -2.1875,  1.7109, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812,  0.5625,  2.6406, -1.4766, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -1.7266,  2.6875, -1.2109, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -2.8438,  0.8789,  0.3398, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -1.9609,  1.5859,  0.0620, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -1.9141,  2.5156, -1.3047, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0781, -0.6328,  2.2969, -0.2412, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:17:33,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:17:33,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.17 | bwd_microstep: 4.30 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 3.25 | step_microstep: 1.70
[2025-11-06 18:17:33,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 420.33 | bwd: 5.26 | bwd_inner: 1.83 | bwd_allreduce: 3.29 | step: 1.79
 38%|███▊      | 1350/3507 [32:47<41:36,  1.16s/it]                                                   {'loss': 0.1565, 'learning_rate': 1.4092947694366687e-05, 'epoch': 0.38}
 38%|███▊      | 1350/3507 [32:47<41:36,  1.16s/it]tensor([[-4.2812, -4.4375, -2.1406,  1.7734, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:34,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.33 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.2812, -3.8750,  0.3379, -0.9414, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.6406, -3.0000,  0.0084,  2.4844, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-2.1250, -2.1250, -1.1953,  1.2656, -0.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.1875, -2.7031,  1.9062,  0.8047, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3750, -3.7812, -0.6328,  2.4375, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219, -2.7656, -0.2051,  3.6562, -1.0391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -3.3281,  0.6914,  1.9375, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:17:36,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:17:36,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.80 | bwd_microstep: 2543.38 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 2542.20 | step_microstep: 1.69
[2025-11-06 18:17:36,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.16 | bwd: 2544.31 | bwd_inner: 1.91 | bwd_allreduce: 2542.25 | step: 1.79
 39%|███▊      | 1351/3507 [32:50<1:00:40,  1.69s/it]                                                     {'loss': 0.4107, 'learning_rate': 1.4084517851522466e-05, 'epoch': 0.39}
 39%|███▊      | 1351/3507 [32:50<1:00:40,  1.69s/it]tensor([[-3.0312,  0.0762,  1.7188, -2.6250, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.7812,  0.2490,  2.5469, -1.5547, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:36,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.82 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0312, -3.4375, -0.7617,  1.5391, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -0.9297,  2.5781, -0.0776, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9688, -4.3125, -0.0708,  0.9453, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -1.2891,  1.9766, -1.1016, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -4.0625, -2.4531,  1.8047, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000,  0.3086,  3.1406, -0.2012, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:17:37,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:17:37,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.95 | bwd_microstep: 2.64 | bwd_inner_microstep: 1.90 | bwd_allreduce_microstep: 0.67 | step_microstep: 1.86
[2025-11-06 18:17:37,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.79 | bwd: 3.49 | bwd_inner: 2.67 | bwd_allreduce: 0.70 | step: 1.93
 39%|███▊      | 1352/3507 [32:50<46:49,  1.30s/it]                                                     {'loss': 1.1107, 'learning_rate': 1.4076084523481905e-05, 'epoch': 0.39}
 39%|███▊      | 1352/3507 [32:50<46:49,  1.30s/it]tensor([[-3.2812,  0.1240,  2.8750, -1.9531, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -3.8438, -0.3203,  0.9414, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:17:37,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.42 | bwd_microstep: 1.15 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9375, -1.4688,  3.1562, -0.5195, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8516,  0.3398,  2.4844,  0.1094, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7656,  0.5781,  3.5625, -0.7422, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6562, -4.3125, -0.3711,  0.9609, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -5.1250, -0.5352,  1.7734, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -1.0000,  2.7812, -1.1094, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:17:39,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.17 | optimizer_step: 0.25
[2025-11-06 18:17:39,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.90 | bwd_microstep: 1947.93 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1946.81 | step_microstep: 2.28
[2025-11-06 18:17:39,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.34 | bwd: 1949.08 | bwd_inner: 2.09 | bwd_allreduce: 1946.85 | step: 2.36
 39%|███▊      | 1353/3507 [32:53<57:45,  1.61s/it]                                                   {'loss': 0.2885, 'learning_rate': 1.4067647717440909e-05, 'epoch': 0.39}
 39%|███▊      | 1353/3507 [32:53<57:45,  1.61s/it]tensor([[-4.3750, -4.4688, -1.9453,  2.3125, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562, -0.1797,  2.3281, -0.0193, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7305,  2.2656,  4.0312, -0.5469, -1.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:17:39,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.81 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5781, -0.6758,  2.2344, -1.1875, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -2.2812,  1.9531,  2.2656, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1250,  1.4219,  3.5312, -2.1875, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2188, -3.6875, -0.7930,  2.0156, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5625, -3.7344,  0.4961,  0.7852, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:17:39,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:17:39,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.12 | bwd_microstep: 22.55 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 21.60 | step_microstep: 1.84
[2025-11-06 18:17:39,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.92 | bwd: 23.40 | bwd_inner: 1.64 | bwd_allreduce: 21.63 | step: 1.91
 39%|███▊      | 1354/3507 [32:53<45:50,  1.28s/it]                                                   {'loss': 1.0043, 'learning_rate': 1.4059207440598357e-05, 'epoch': 0.39}
 39%|███▊      | 1354/3507 [32:53<45:50,  1.28s/it]tensor([[-7.3125, -5.2500, -0.4434, -0.1533, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -2.9219, -0.4766,  0.5742, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:40,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.85 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.0000, -2.4531,  1.4062,  2.5312, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7969,  0.2012,  2.0469, -2.3438, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.9766,  1.0078,  3.8438,  0.1592, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2969, -1.9922,  1.2031,  2.0625, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -0.8242,  2.8125, -0.9727, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969, -1.3125,  1.2266,  2.6250, -1.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:17:41,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.18 | optimizer_step: 0.24
[2025-11-06 18:17:41,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.18 | bwd_microstep: 1456.00 | bwd_inner_microstep: 7.94 | bwd_allreduce_microstep: 1447.93 | step_microstep: 2.26
[2025-11-06 18:17:41,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.06 | bwd: 1456.83 | bwd_inner: 8.70 | bwd_allreduce: 1447.97 | step: 2.33
 39%|███▊      | 1355/3507 [32:55<52:22,  1.46s/it]                                                   {'loss': 0.5383, 'learning_rate': 1.4050763700156074e-05, 'epoch': 0.39}
 39%|███▊      | 1355/3507 [32:55<52:22,  1.46s/it]tensor([[-5.4062, -2.5469,  2.2812,  0.0280, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9375, -0.3730,  0.8281, -2.4688, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.2500, -3.6406,  1.3047, -0.0991, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -2.4219,  2.2031,  0.4883, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:42,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.03 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -3.7656, -0.3203,  2.3438, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -2.0000,  2.2344, -1.1016, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -0.4766,  3.6562, -1.1797, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -4.1562, -1.4766,  1.4844, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:42,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:17:42,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.54 | bwd_microstep: 1.82 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.73
[2025-11-06 18:17:42,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.59 | bwd: 2.85 | bwd_inner: 1.80 | bwd_allreduce: 0.90 | step: 1.82
 39%|███▊      | 1356/3507 [32:56<41:35,  1.16s/it]                                                   {'loss': 0.4583, 'learning_rate': 1.4042316503318858e-05, 'epoch': 0.39}
 39%|███▊      | 1356/3507 [32:56<41:35,  1.16s/it]tensor([[-3.8438, -1.4531,  2.0312,  0.4375, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7969,  0.2373,  2.8281, -1.1719, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -2.0000,  1.4844,  0.0884, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:42,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.80 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -1.9297,  1.7109,  1.1406, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9844, -0.4492,  3.2500, -1.7656, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -4.9375,  0.1182,  1.5234, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -1.2109,  2.0469, -3.0156, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -3.7656, -0.6641,  2.5000, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:17:45,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:17:45,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.39 | bwd_microstep: 2573.07 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2571.81 | step_microstep: 2.33
[2025-11-06 18:17:45,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.22 | bwd: 2573.92 | bwd_inner: 1.94 | bwd_allreduce: 2571.84 | step: 2.41
 39%|███▊      | 1357/3507 [32:59<1:00:12,  1.68s/it]                                                     {'loss': 0.2462, 'learning_rate': 1.4033865857294447e-05, 'epoch': 0.39}
 39%|███▊      | 1357/3507 [32:59<1:00:12,  1.68s/it]tensor([[-3.3750, -0.6250,  3.2188,  0.9102, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:45,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.22 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.3281,  0.4551,  1.7422,  0.3477, -1.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.6562, -2.6094,  0.9805,  0.2598, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -0.5547,  3.0156, -0.7461, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -4.1875, -1.5391,  2.4219, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -4.5625,  0.3184,  2.1562, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -0.5469,  2.6875, -2.3594, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0312, -0.4492,  3.7500, -0.9883, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:17:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:17:45,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.61 | bwd_microstep: 94.53 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 93.66 | step_microstep: 1.82
[2025-11-06 18:17:45,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.87 | bwd: 95.46 | bwd_inner: 1.60 | bwd_allreduce: 93.71 | step: 1.89
 39%|███▊      | 1358/3507 [32:59<47:39,  1.33s/it]                                                     {'loss': 0.4184, 'learning_rate': 1.402541176929352e-05, 'epoch': 0.39}
 39%|███▊      | 1358/3507 [32:59<47:39,  1.33s/it]tensor([[-4.5000, -0.5859,  3.3750, -2.5312, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -3.7500, -0.0845,  3.2656, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:45,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.38 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5938, -2.2031,  1.3125, -0.3125, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5312, -5.0312,  0.5938,  0.3086, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -1.8047,  1.5938,  0.2002, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8281e+00, -2.1406e+00,  4.4531e-01,  3.5858e-04, -2.7656e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.3125, -0.0732,  2.2812, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8555,  2.4375,  5.0625,  0.0884, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:17:48,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:17:48,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.00 | bwd_microstep: 2426.28 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 2425.34 | step_microstep: 1.99
[2025-11-06 18:17:48,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.39 | bwd: 2427.02 | bwd_inner: 1.51 | bwd_allreduce: 2425.38 | step: 2.06
 39%|███▉      | 1359/3507 [33:02<1:03:16,  1.77s/it]                                                     {'loss': 0.2311, 'learning_rate': 1.4016954246529697e-05, 'epoch': 0.39}
 39%|███▉      | 1359/3507 [33:02<1:03:16,  1.77s/it]tensor([[-3.0312, -3.4375, -2.1719,  1.5391, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:48,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.96 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.9688,  1.5938,  3.4375, -2.5625, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.2188, -1.1875,  2.5000, -1.1172, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -2.7031,  0.5859, -0.8398, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5859,  1.5000,  2.7969, -1.9375, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -2.0156,  0.9922,  0.9766, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -4.2812, -0.8867,  2.7969, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9062, -4.7812,  0.3477,  0.5742, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:17:48,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 18:17:48,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.83 | bwd_microstep: 156.03 | bwd_inner_microstep: 1.84 | bwd_allreduce_microstep: 154.09 | step_microstep: 2.34
[2025-11-06 18:17:48,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 264.82 | bwd: 156.93 | bwd_inner: 2.66 | bwd_allreduce: 154.13 | step: 2.41
 39%|███▉      | 1360/3507 [33:02<49:07,  1.37s/it]                                                     {'loss': 0.5318, 'learning_rate': 1.400849329621953e-05, 'epoch': 0.39}
 39%|███▉      | 1360/3507 [33:02<49:07,  1.37s/it]tensor([[-5.0312, -4.0625, -0.1709,  2.4688, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:49,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.46 | bwd_microstep: 5.59 | bwd_inner_microstep: 5.45 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-5.2812, -4.6250, -1.1250,  1.9141, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -4.1250, -0.4395,  2.2500, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -0.5664,  2.4844, -2.1719, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.3281, -2.0156,  1.8828,  3.8125, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.9375, -0.7695,  2.5312, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -2.9219,  0.8242, -2.4531, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4375, -3.3281,  1.7031, -0.8633, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:17:52,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.23 | optimizer_step: 0.25
[2025-11-06 18:17:52,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.51 | bwd_microstep: 2770.25 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 2768.78 | step_microstep: 3.28
[2025-11-06 18:17:52,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.97 | bwd: 2775.85 | bwd_inner: 6.75 | bwd_allreduce: 2768.90 | step: 3.38
 39%|███▉      | 1361/3507 [33:05<1:07:53,  1.90s/it]                                                     {'loss': 1.028, 'learning_rate': 1.400002892558249e-05, 'epoch': 0.39}
 39%|███▉      | 1361/3507 [33:05<1:07:53,  1.90s/it]tensor([[-4.6250, -2.0156,  2.5469,  1.0703, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -3.7969, -0.4531,  1.9766, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9219, -0.6719,  1.8672, -0.3320, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.0625, -4.0312, -0.5703,  1.4609, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:52,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.03 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.8750,  0.4453,  3.1094, -1.8047, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281, -2.5781,  1.2344,  3.2344, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -3.2344, -0.0374,  3.0625, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -2.1562,  1.4453,  1.6328, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:52,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:17:52,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.80 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.77 | step_microstep: 3.03
[2025-11-06 18:17:52,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 397.87 | bwd: 2.63 | bwd_inner: 1.65 | bwd_allreduce: 0.82 | step: 3.12
 39%|███▉      | 1362/3507 [33:06<52:23,  1.47s/it]                                                     {'loss': 0.5096, 'learning_rate': 1.3991561141840976e-05, 'epoch': 0.39}
 39%|███▉      | 1362/3507 [33:06<52:23,  1.47s/it]tensor([[-2.5469,  0.3086,  2.4375, -1.2812, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.5000, -7.5312, -2.8281,  0.4590, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:52,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.84 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3438, -4.0312, -0.0674,  1.5391, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -3.2969,  0.2158,  1.5781, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.5312,  0.0515,  1.2734, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -3.2031,  0.1328,  1.3203, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.0938,  0.9766,  0.0304, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -4.6875, -1.8125,  2.1094, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:17:55,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:17:55,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.54 | bwd_microstep: 2059.73 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 2058.55 | step_microstep: 1.80
[2025-11-06 18:17:55,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.40 | bwd: 2060.50 | bwd_inner: 1.75 | bwd_allreduce: 2058.59 | step: 1.89
 39%|███▉      | 1363/3507 [33:08<1:02:57,  1.76s/it]                                                     {'loss': 0.3528, 'learning_rate': 1.3983089952220289e-05, 'epoch': 0.39}
 39%|███▉      | 1363/3507 [33:08<1:02:57,  1.76s/it]tensor([[-5.5625, -5.5000, -2.3125,  2.2500, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7188, -5.1250, -1.7266,  1.3281, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:55,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.18 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.2500, -7.0000, -2.2656,  0.1138, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.6094, -0.7695,  1.5312, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500,  0.1216,  2.6406, -1.2656, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -4.3438, -1.0781,  3.0781, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7188, -1.1719,  2.4688,  0.4180, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -1.2188,  2.5625,  0.0064, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:55,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:17:55,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.10 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.49
[2025-11-06 18:17:55,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.31 | bwd: 2.86 | bwd_inner: 1.89 | bwd_allreduce: 0.86 | step: 1.57
 39%|███▉      | 1364/3507 [33:09<49:13,  1.38s/it]                                                     {'loss': 0.4019, 'learning_rate': 1.397461536394864e-05, 'epoch': 0.39}
 39%|███▉      | 1364/3507 [33:09<49:13,  1.38s/it]tensor([[-1.7969, -2.0938, -0.8594,  2.7188, -0.2637]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:55,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.26 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.0625, -1.9922,  0.6406,  1.8047, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -3.3281, -0.1416,  2.2656, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -2.3438,  1.3828,  0.3281, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8281, -0.2812,  2.9531,  3.4531, -0.8320]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -4.7812, -0.7656,  2.5469, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.1719,  1.6094, -0.6914, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.0312,  0.9180,  1.2891, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:17:57,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:17:57,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.64 | bwd_microstep: 1602.77 | bwd_inner_microstep: 1.65 | bwd_allreduce_microstep: 1601.00 | step_microstep: 1.94
[2025-11-06 18:17:57,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.93 | bwd: 1603.83 | bwd_inner: 2.58 | bwd_allreduce: 1601.06 | step: 2.05
 39%|███▉      | 1365/3507 [33:11<55:04,  1.54s/it]                                                   {'loss': 0.3993, 'learning_rate': 1.3966137384257145e-05, 'epoch': 0.39}
 39%|███▉      | 1365/3507 [33:11<55:04,  1.54s/it]tensor([[-2.6875,  0.3691,  3.0312, -1.2812, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -3.2031,  0.2637,  0.4160, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9062,  0.0962,  2.8594, -1.0703, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:17:57,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.21 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-5.9375, -3.8594,  0.8594,  1.1094, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -4.1562, -0.1445,  0.8047, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -4.2500, -0.3516,  0.5820, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406, -1.7891,  0.6719,  2.5469, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2500, -5.0000, -2.0469,  1.6484, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:17:57,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:17:57,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.10 | bwd_microstep: 22.57 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 21.50 | step_microstep: 1.71
[2025-11-06 18:17:57,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.35 | bwd: 23.63 | bwd_inner: 1.90 | bwd_allreduce: 21.56 | step: 1.83
 39%|███▉      | 1366/3507 [33:11<43:31,  1.22s/it]                                                   {'loss': 0.4237, 'learning_rate': 1.3957656020379806e-05, 'epoch': 0.39}
 39%|███▉      | 1366/3507 [33:11<43:31,  1.22s/it]tensor([[-5.0000, -3.9688, -0.5156,  1.3516, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:17:58,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.37 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.5625, -2.0156,  0.9062,  1.0938, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.1406,  0.3574,  2.7969, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156, -2.4531,  0.5391,  3.6406, -1.2266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -1.2344,  2.0000, -2.4219, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1250, -5.4688, -0.2852,  1.4922, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -5.1562, -1.6172,  1.9141, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -2.6250,  0.5742,  1.7578, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:17:59,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:17:59,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.70 | bwd_microstep: 1344.51 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1343.33 | step_microstep: 2.06
[2025-11-06 18:17:59,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 480.10 | bwd: 1345.39 | bwd_inner: 1.89 | bwd_allreduce: 1343.37 | step: 2.13
 39%|███▉      | 1367/3507 [33:13<50:25,  1.41s/it]                                                   {'loss': 0.5573, 'learning_rate': 1.3949171279553515e-05, 'epoch': 0.39}
 39%|███▉      | 1367/3507 [33:13<50:25,  1.41s/it]tensor([[-8.1250, -5.2812,  0.1641, -1.2500, -6.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8125, -2.5312,  0.6602,  1.6250, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:00,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.87 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3750, -2.9844,  0.7344,  2.1875, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2500, -0.2559,  1.7734,  0.2324, -1.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -4.9688, -2.8750,  0.2773, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -1.9375,  1.9453,  1.0781, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -2.5938,  0.2197,  3.3594, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -3.1562, -1.1875,  2.1250, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:18:00,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:18:00,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.64 | bwd_microstep: 455.25 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 454.24 | step_microstep: 2.73
[2025-11-06 18:18:00,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.54 | bwd: 456.18 | bwd_inner: 1.76 | bwd_allreduce: 454.28 | step: 2.81
 39%|███▉      | 1368/3507 [33:14<44:55,  1.26s/it]                                                   {'loss': 0.3234, 'learning_rate': 1.394068316901805e-05, 'epoch': 0.39}
 39%|███▉      | 1368/3507 [33:14<44:55,  1.26s/it]tensor([[-5.1562, -4.0625, -0.3242,  1.7188, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.3750, -4.7188, -0.4863,  0.4570, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([3], device='cuda:3')
tensor([[-2.6406, -1.4766,  2.0312,  4.0000, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:00,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.83 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.9531, -3.2812, -0.6719,  1.2734, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -0.8633,  2.6562, -0.2080, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5938,  0.7266,  3.2344, -1.6953, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -4.0625, -1.1484,  2.5781, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -1.8672,  1.4609, -0.7656, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:18:03,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.27 | optimizer_step: 0.26
[2025-11-06 18:18:03,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 118.58 | bwd_microstep: 2331.51 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 2330.61 | step_microstep: 2.65
[2025-11-06 18:18:03,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.42 | bwd: 2332.19 | bwd_inner: 1.35 | bwd_allreduce: 2330.67 | step: 2.74
 39%|███▉      | 1369/3507 [33:17<1:05:01,  1.83s/it]                                                     {'loss': 0.5021, 'learning_rate': 1.3932191696016055e-05, 'epoch': 0.39}
 39%|███▉      | 1369/3507 [33:17<1:05:01,  1.83s/it]tensor([[-3.0938, -3.0469, -0.6133,  3.4219, -1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -1.9375,  1.0234, -0.6406, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4531, -2.3281,  0.6484,  2.1406, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:18:04,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.63 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-3.3750, -2.6094,  0.3301,  2.7031, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -6.0625, -2.6406,  1.1875, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -0.9883,  2.4844,  0.6250, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7344, -0.4102,  2.5312,  0.6484, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7812,  0.9961,  3.9062, -2.0000, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:18:04,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:18:04,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.56 | bwd_microstep: 4.35 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 3.47 | step_microstep: 2.98
[2025-11-06 18:18:04,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.22 | bwd: 5.60 | bwd_inner: 1.85 | bwd_allreduce: 3.54 | step: 3.11
 39%|███▉      | 1370/3507 [33:18<50:08,  1.41s/it]                                                     {'loss': 0.1323, 'learning_rate': 1.3923696867793055e-05, 'epoch': 0.39}
 39%|███▉      | 1370/3507 [33:18<50:08,  1.41s/it]tensor([[-3.6562, -0.7148,  3.0781,  0.1602, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:04,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.44 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -3.7812,  0.8672,  2.4062, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -3.9375,  0.2324,  1.1875, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -4.1562, -0.5039,  1.7656, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -3.0781,  0.2432,  0.5625, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6250,  1.0312,  3.6562, -2.2344, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -3.6719,  0.2412,  1.6875, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.4688, -0.1797,  1.9375, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:18:07,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:18:07,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.28 | bwd_microstep: 1395.06 | bwd_inner_microstep: 3.35 | bwd_allreduce_microstep: 1391.60 | step_microstep: 2.22
[2025-11-06 18:18:07,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.74 | bwd: 1395.79 | bwd_inner: 3.97 | bwd_allreduce: 1391.65 | step: 2.31
 39%|███▉      | 1371/3507 [33:21<1:07:34,  1.90s/it]                                                     {'loss': 0.5261, 'learning_rate': 1.3915198691597427e-05, 'epoch': 0.39}
 39%|███▉      | 1371/3507 [33:21<1:07:34,  1.90s/it]tensor([[-2.8594,  0.7344,  3.3438, -2.1250, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0938, -3.3906, -0.4824,  1.6641, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1484, -0.6602,  1.7344,  4.4062,  0.1572]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -3.7969, -0.0508,  1.8047, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -3.6719,  1.1172, -0.0874, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -3.7500,  0.3457,  1.0234, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -4.2500, -0.5195,  2.1094, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:08,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.70 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
tensor([[-5.3438, -2.6094,  1.1641, -1.2656, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:08,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:18:08,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.75 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.66
[2025-11-06 18:18:08,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.49 | bwd: 2.81 | bwd_inner: 1.79 | bwd_allreduce: 0.84 | step: 2.75
 39%|███▉      | 1372/3507 [33:22<58:05,  1.63s/it]                                                     {'loss': 0.8282, 'learning_rate': 1.390669717468041e-05, 'epoch': 0.39}
 39%|███▉      | 1372/3507 [33:22<58:05,  1.63s/it]tensor([[-2.0000, -0.6562,  1.8359,  2.5469, -1.0078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750, -3.0938, -0.5391,  2.8125, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:08,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.71 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -3.1875,  0.3750,  1.3594, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438,  0.4004,  2.7812, -1.8750, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0312, -0.8359,  2.6562, -1.2734, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -5.2188, -3.0000,  1.5703, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.7812, -3.3438,  1.6328,  0.7969, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500, -4.0312, -2.3125,  1.5078, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:18:08,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:18:08,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.32 | bwd_microstep: 27.85 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 26.73 | step_microstep: 1.47
[2025-11-06 18:18:08,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.06 | bwd: 28.76 | bwd_inner: 1.86 | bwd_allreduce: 26.77 | step: 1.55
 39%|███▉      | 1373/3507 [33:22<45:48,  1.29s/it]                                                   {'loss': 1.4843, 'learning_rate': 1.3898192324296096e-05, 'epoch': 0.39}
 39%|███▉      | 1373/3507 [33:22<45:48,  1.29s/it]tensor([[-4.1562, -1.3438,  2.4688,  0.1816, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -3.4688,  0.0562, -1.8672, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -3.8438, -1.2578,  2.8438, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -6.2812, -3.6406,  0.7930, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -1.4688,  2.2812,  0.3359, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -4.2188,  0.4277,  1.5625, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -2.2500,  2.2188, -0.2139, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:10,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.91 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0000, -2.0156,  2.4375, -0.3340, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:10,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:18:10,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.17 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.32
[2025-11-06 18:18:10,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.09 | bwd: 2.80 | bwd_inner: 1.79 | bwd_allreduce: 0.87 | step: 2.40
 39%|███▉      | 1374/3507 [33:24<48:30,  1.36s/it]                                                   {'loss': 0.548, 'learning_rate': 1.3889684147701417e-05, 'epoch': 0.39}
 39%|███▉      | 1374/3507 [33:24<48:30,  1.36s/it]tensor([[-5.5312e+00, -3.7344e+00, -4.0588e-03,  2.5000e-01, -3.9375e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -3.1875, -0.2891,  1.9766, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:10,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.44 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0000, -4.5938, -1.2188,  2.4688, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -2.6875,  0.1040,  2.5625, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -3.2656, -0.8984,  2.4844, -1.5703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9297,  1.3203,  2.9219, -2.3750, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9062, -2.8594,  0.4668,  2.5000, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031, -2.1875,  1.1797,  1.7969, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:18:10,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:18:10,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.33 | bwd_microstep: 88.54 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 87.60 | step_microstep: 1.62
[2025-11-06 18:18:10,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.79 | bwd: 89.41 | bwd_inner: 1.65 | bwd_allreduce: 87.64 | step: 1.70
 39%|███▉      | 1375/3507 [33:24<39:07,  1.10s/it]                                                   {'loss': 0.4883, 'learning_rate': 1.388117265215614e-05, 'epoch': 0.39}
 39%|███▉      | 1375/3507 [33:24<39:07,  1.10s/it]tensor([[-5.5938, -3.6875,  0.3086,  0.2441, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -4.1250, -1.8047,  2.5625, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1094,  0.0278,  2.8125,  1.9219, -1.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5156, -2.9062, -1.7812,  1.8516, -0.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -3.5781, -1.1562,  2.4219, -1.7109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3281, -2.4219, -0.1484,  3.9375, -0.4805]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9297,  0.5117,  3.5312,  1.4922, -1.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:13,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.40 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7500, -3.2031,  0.5039,  1.4609, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:13,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.29 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:18:13,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.27 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.91 | step_microstep: 3.59
[2025-11-06 18:18:13,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.68 | bwd: 3.07 | bwd_inner: 1.98 | bwd_allreduce: 0.95 | step: 3.69
 39%|███▉      | 1376/3507 [33:27<56:41,  1.60s/it]                                                   {'loss': 0.4903, 'learning_rate': 1.3872657844922879e-05, 'epoch': 0.39}
 39%|███▉      | 1376/3507 [33:27<56:41,  1.60s/it]tensor([[-5.5312, -2.9844,  1.6328,  0.3652, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -3.0000,  0.8242,  1.5469, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5625,  1.0156,  3.3125, -2.3438, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:13,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.53 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -4.0000, -0.2490,  1.9219, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -1.3125,  1.7422, -0.3945, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1719,  0.5000,  3.3438, -2.3594, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -4.0312, -1.2969,  2.7344, -1.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -0.9102,  2.9844, -0.8008, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:18:14,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:18:14,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.98 | bwd_microstep: 130.80 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 129.93 | step_microstep: 1.81
[2025-11-06 18:18:14,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.54 | bwd: 131.81 | bwd_inner: 1.67 | bwd_allreduce: 129.97 | step: 1.89
 39%|███▉      | 1377/3507 [33:27<45:00,  1.27s/it]                                                   {'loss': 0.1483, 'learning_rate': 1.3864139733267047e-05, 'epoch': 0.39}
 39%|███▉      | 1377/3507 [33:27<45:00,  1.27s/it]tensor([[-3.8125, -0.6328,  2.5625, -1.6094, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -3.0000,  1.5156,  0.9414, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -4.0625, -0.0239, -0.5195, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -1.4297,  2.1094, -0.1128, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -3.8125, -0.2188,  1.7500, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.1562, -0.4707,  1.0156, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.6562,  1.3281,  0.9453, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:15,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.22 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.22
tensor([[-6.0625, -4.7812, -0.6875,  1.0781, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:15,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.24 | optimizer_step: 0.28
[2025-11-06 18:18:15,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.38 | bwd_microstep: 2.05 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.99 | step_microstep: 2.39
[2025-11-06 18:18:15,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.61 | bwd: 3.15 | bwd_inner: 1.95 | bwd_allreduce: 1.03 | step: 2.62
 39%|███▉      | 1378/3507 [33:29<50:05,  1.41s/it]                                                   {'loss': 0.386, 'learning_rate': 1.3855618324456912e-05, 'epoch': 0.39}
 39%|███▉      | 1378/3507 [33:29<50:05,  1.41s/it]tensor([[-5.0938, -2.9219,  1.6250,  1.5781, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:15,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.74 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1562, -3.8281, -0.8750,  2.5938, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -5.1562, -0.8555,  0.4082, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -3.9219, -1.5703,  2.8750, -1.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.8125, -0.9062,  3.4844, -1.8906, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -4.0312, -1.0547,  1.7422, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -2.8125, -0.7578,  2.1875, -1.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.5000, -0.5195,  2.2656, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:18:16,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:18:16,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.32 | bwd_microstep: 208.73 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 207.41 | step_microstep: 1.76
[2025-11-06 18:18:16,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.08 | bwd: 209.52 | bwd_inner: 1.88 | bwd_allreduce: 207.46 | step: 1.85
 39%|███▉      | 1379/3507 [33:30<41:12,  1.16s/it]                                                   {'loss': 0.8692, 'learning_rate': 1.3847093625763517e-05, 'epoch': 0.39}
 39%|███▉      | 1379/3507 [33:30<41:12,  1.16s/it]tensor([[-1.8281, -0.9766,  1.8203,  4.0938, -0.4551]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -2.6719,  1.2891,  2.4531, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.2969,  0.1738,  1.8594, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -2.2344,  2.0000, -1.7578, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -4.5312, -0.7109,  1.3750, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -3.4375, -1.7812,  2.0312, -1.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9531,  1.1719,  1.9609, -0.2090, -0.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:18:18,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.34 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2188, -0.8672,  2.1562,  0.1387, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:23,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.38 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:18:23,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.81 | bwd_microstep: 1.74 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.77 | step_microstep: 3.96
[2025-11-06 18:18:23,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.17 | bwd: 2.56 | bwd_inner: 1.58 | bwd_allreduce: 0.82 | step: 4.05
 39%|███▉      | 1380/3507 [33:36<1:39:53,  2.82s/it]                                                     {'loss': 0.5488, 'learning_rate': 1.3838565644460745e-05, 'epoch': 0.39}
 39%|███▉      | 1380/3507 [33:36<1:39:53,  2.82s/it]tensor([[-4.4688, -2.6250,  1.2656,  1.1328, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.3125, -6.8125, -2.8125, -1.7188, -6.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0703, -1.1562,  0.8555,  4.7812,  0.4629]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -2.9375,  0.3145,  0.4609, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3125, -0.2031,  2.4062, -1.5547, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531, -2.5781,  0.0552,  2.2500, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -3.3125,  0.2158,  2.8906, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:23,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.15 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1094, -1.9062,  1.2031,  2.5781, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:23,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 18:18:23,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.80 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.62
[2025-11-06 18:18:23,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 618.98 | bwd: 2.58 | bwd_inner: 1.62 | bwd_allreduce: 0.84 | step: 1.71
 39%|███▉      | 1381/3507 [33:37<1:17:06,  2.18s/it]                                                     {'loss': 0.2931, 'learning_rate': 1.383003438782526e-05, 'epoch': 0.39}
 39%|███▉      | 1381/3507 [33:37<1:17:06,  2.18s/it]tensor([[-3.4531, -1.9297,  1.3672,  2.0781, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -4.7188, -0.6406, -0.3848, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.8594, -1.3438,  2.0000, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:23,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.37 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-4.6250, -1.8125,  2.2812,  0.0742, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -1.2109,  3.0625, -0.5117, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -1.7344,  2.6875, -0.6680, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -2.0625,  2.8906, -1.6562, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -1.6562,  2.7812, -1.9766, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:25,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:18:25,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.52 | bwd_microstep: 2.46 | bwd_inner_microstep: 1.58 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.79
[2025-11-06 18:18:25,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 462.91 | bwd: 3.47 | bwd_inner: 2.43 | bwd_allreduce: 0.87 | step: 1.90
 39%|███▉      | 1382/3507 [33:39<1:12:56,  2.06s/it]                                                     {'loss': 0.1651, 'learning_rate': 1.382149986313653e-05, 'epoch': 0.39}
 39%|███▉      | 1382/3507 [33:39<1:12:56,  2.06s/it]tensor([[-4.5000, -2.2188,  1.3125,  0.1484, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:25,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.71 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.6562, -3.3281,  1.1016,  0.3867, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219,  0.5547,  3.6719, -1.3750, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -4.5625,  0.3125, -0.1865, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -3.0938,  1.1719,  1.6484, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9062, -1.4531,  2.2031, -2.2344, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4531, -3.5312, -1.3594,  2.2500, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -5.0000, -1.9766,  0.9531, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:18:26,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:18:26,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.30 | bwd_microstep: 126.36 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 125.23 | step_microstep: 1.95
[2025-11-06 18:18:26,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.03 | bwd: 127.22 | bwd_inner: 1.82 | bwd_allreduce: 125.28 | step: 2.03
 39%|███▉      | 1383/3507 [33:39<56:05,  1.58s/it]                                                     {'loss': 0.756, 'learning_rate': 1.3812962077676801e-05, 'epoch': 0.39}
 39%|███▉      | 1383/3507 [33:39<56:05,  1.58s/it]tensor([[-5.5000, -3.9531,  0.4102,  1.7109, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-5.4688, -4.0938,  0.3906,  2.3750, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:2')tensor([3], device='cuda:1')

tensor([[-2.8125, -0.8281,  1.8516,  1.1172, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:26,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.32 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8125, -4.1562, -0.5586,  2.6094, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -1.0156,  2.8281, -1.4531, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.3613, -0.3516,  1.3359,  4.9375,  0.9727]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -3.0156,  0.6562,  2.9062, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -4.9062, -0.4375,  2.0938, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:27,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:18:27,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.91 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.84 | step_microstep: 3.02
[2025-11-06 18:18:27,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.26 | bwd: 2.56 | bwd_inner: 1.55 | bwd_allreduce: 0.87 | step: 3.11
 39%|███▉      | 1384/3507 [33:41<54:45,  1.55s/it]                                                   {'loss': 0.1457, 'learning_rate': 1.3804421038731122e-05, 'epoch': 0.39}
 39%|███▉      | 1384/3507 [33:41<54:45,  1.55s/it]tensor([[-3.2500e+00, -2.1057e-03,  3.3594e+00, -5.8203e-01, -3.0625e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6875, -2.2812,  0.7852,  4.3750, -0.8477]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.2188,  0.2539,  2.5625, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -1.1250,  1.9688, -0.9844, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:27,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.89 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.6562, -1.5078,  1.2422, -2.9219, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -1.7969,  1.6641, -1.9766, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -1.3203,  2.8438, -0.5312, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.1250, -3.9688, -0.1221,  1.8906, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:18:27,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:18:27,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.04 | bwd_microstep: 20.43 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 19.41 | step_microstep: 1.96
[2025-11-06 18:18:27,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.97 | bwd: 21.34 | bwd_inner: 1.72 | bwd_allreduce: 19.45 | step: 2.05
 39%|███▉      | 1385/3507 [33:41<42:53,  1.21s/it]                                                   {'loss': 0.5986, 'learning_rate': 1.3795876753587292e-05, 'epoch': 0.39}
 39%|███▉      | 1385/3507 [33:41<42:53,  1.21s/it]tensor([[-3.8594, -4.0000, -2.0781,  1.5234, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7656,  0.0903,  1.6406, -2.5312, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -4.0938,  0.1230,  1.1797, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:28,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.39 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2656, -2.8281,  0.1338,  3.3438, -1.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -2.8906,  0.4648, -0.2266, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -4.3125, -1.7344,  1.5625, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.7656,  1.1094,  0.1641, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -0.7695,  2.6562, -0.2852, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:30,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:18:30,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.94 | bwd_microstep: 1.80 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.94
[2025-11-06 18:18:30,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.33 | bwd: 2.45 | bwd_inner: 1.42 | bwd_allreduce: 0.88 | step: 3.02
 40%|███▉      | 1386/3507 [33:44<58:37,  1.66s/it]                                                   {'loss': 0.285, 'learning_rate': 1.37873292295359e-05, 'epoch': 0.4}
 40%|███▉      | 1386/3507 [33:44<58:37,  1.66s/it]tensor([[-4.8125, -3.7188, -0.2227,  1.6094, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -4.3125, -0.9570,  2.0938, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -1.0547,  2.1719, -1.9609, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -4.4688, -1.5312,  2.9531, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:30,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.63 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5625, -3.9375,  0.2891,  1.0391, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -3.5312,  0.2949,  2.3906, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7188, -1.8672,  0.6602,  2.6719, -1.2891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -2.0625,  0.8555,  3.0312, -1.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:18:31,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:18:31,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.86 | bwd_microstep: 25.25 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 24.10 | step_microstep: 1.50
[2025-11-06 18:18:31,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.52 | bwd: 26.09 | bwd_inner: 1.81 | bwd_allreduce: 24.14 | step: 1.58
 40%|███▉      | 1387/3507 [33:44<45:21,  1.28s/it]                                                   {'loss': 0.7218, 'learning_rate': 1.377877847387029e-05, 'epoch': 0.4}
 40%|███▉      | 1387/3507 [33:44<45:21,  1.28s/it]tensor([[-3.8438, -3.4062, -0.4297,  2.8281, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -4.3438, -1.8906,  2.3594, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -1.8906,  2.7969, -0.2363, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -4.1562,  0.5039,  2.1094, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:31,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.79 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.7188, -3.2969, -1.9453,  2.4219, -0.7383]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -4.4062, -1.2109,  1.8516, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1875, -3.6094,  1.2109,  0.0206, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.8906, -0.1699,  2.0938, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:32,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:18:32,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.54 | bwd_microstep: 2.31 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1.02 | step_microstep: 2.52
[2025-11-06 18:18:32,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.35 | bwd: 3.10 | bwd_inner: 1.91 | bwd_allreduce: 1.05 | step: 2.61
 40%|███▉      | 1388/3507 [33:46<45:23,  1.29s/it]                                                   {'loss': 0.1031, 'learning_rate': 1.3770224493886565e-05, 'epoch': 0.4}
 40%|███▉      | 1388/3507 [33:46<45:23,  1.29s/it]tensor([[-4.3438, -3.9844, -0.8750,  2.6719, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -1.8203,  2.3750,  0.4492, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.5469,  0.0070,  1.8750, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:32,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.60 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0625, -3.6719, -0.0067,  1.0234, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -2.4062,  0.8750,  0.1963, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469,  0.2402,  3.4062, -0.7500, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -1.8125,  2.4844, -0.6953, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -3.9844, -1.5078,  2.1094, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:18:34,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 18:18:34,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.31 | bwd_microstep: 1819.36 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1818.29 | step_microstep: 2.12
[2025-11-06 18:18:34,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.92 | bwd: 1820.22 | bwd_inner: 1.72 | bwd_allreduce: 1818.35 | step: 2.20
 40%|███▉      | 1389/3507 [33:48<55:26,  1.57s/it]                                                   {'loss': 0.16, 'learning_rate': 1.3761667296883576e-05, 'epoch': 0.4}
 40%|███▉      | 1389/3507 [33:48<55:26,  1.57s/it]tensor([[-6.0312, -3.4531,  0.8203, -0.7383, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.9062, -2.0625,  1.8984, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -3.1562,  0.6328,  2.1094, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:34,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.92 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6875, -4.0000, -0.9727,  1.2969, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4062, -5.3438, -0.1104,  0.4062, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.4219,  0.0864,  2.0625, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2500, -4.2500,  0.1895,  0.4199, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3125, -0.6992,  2.4531,  0.3066, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:35,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:18:35,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.17 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.19
[2025-11-06 18:18:35,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.11 | bwd: 3.00 | bwd_inner: 1.89 | bwd_allreduce: 0.95 | step: 2.27
 40%|███▉      | 1390/3507 [33:49<50:19,  1.43s/it]                                                   {'loss': 0.437, 'learning_rate': 1.3753106890162927e-05, 'epoch': 0.4}
 40%|███▉      | 1390/3507 [33:49<50:19,  1.43s/it]tensor([[-3.9219, -3.2656,  0.2051,  3.2812, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -1.4141,  3.3125, -1.2344, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -0.7461,  3.3750,  0.0374, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:35,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.35 | bwd_microstep: 5.50 | bwd_inner_microstep: 5.35 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[3.5156, 5.0625, 5.3125, 4.0938, 2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -4.5312, -1.3281,  0.9961, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719, -2.5625,  0.6875,  3.9844, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -3.8750,  0.8008,  1.3906, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -2.0000,  2.0625, -2.2500, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:18:37,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:18:37,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.57 | bwd_microstep: 1244.72 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1243.61 | step_microstep: 2.04
[2025-11-06 18:18:37,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 471.94 | bwd: 1250.22 | bwd_inner: 6.39 | bwd_allreduce: 1243.67 | step: 2.14
 40%|███▉      | 1391/3507 [33:51<53:53,  1.53s/it]                                                   {'loss': 0.2741, 'learning_rate': 1.374454328102895e-05, 'epoch': 0.4}
 40%|███▉      | 1391/3507 [33:51<53:53,  1.53s/it]tensor([[-4.0938, -1.3203,  2.1562, -0.2930, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -3.5312,  0.3145,  2.2031, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750, -0.2559,  3.0469,  1.2109, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -4.4375, -0.0212,  1.6172, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:37,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.45 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.6250, -7.2188, -3.2344,  0.5508, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2500, -0.0850,  2.9844, -1.0078, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -3.6562,  0.1621,  1.8281, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594, -2.8281, -0.3398,  2.2969, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:38,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:18:38,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.43 | bwd_microstep: 2.50 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1.22 | step_microstep: 1.56
[2025-11-06 18:18:38,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.90 | bwd: 3.34 | bwd_inner: 1.93 | bwd_allreduce: 1.24 | step: 1.63
 40%|███▉      | 1392/3507 [33:52<47:28,  1.35s/it]                                                   {'loss': 0.3548, 'learning_rate': 1.3735976476788722e-05, 'epoch': 0.4}
 40%|███▉      | 1392/3507 [33:52<47:28,  1.35s/it]tensor([[-5.2812, -3.5938,  0.6992,  1.7031, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4219, -1.3828,  1.8594,  4.2500, -0.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7656, -2.4219,  1.2500,  2.7031, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:18:38,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.05 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -2.9062,  1.4688,  0.7812, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -3.8594, -1.0156,  2.8906, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.8438, -5.6562, -0.1128,  0.2930, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5312, -4.8125, -2.9219,  0.9180, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -0.6797,  3.3438, -1.1797, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:18:38,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:18:38,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.53 | bwd_microstep: 88.33 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 87.11 | step_microstep: 1.52
[2025-11-06 18:18:38,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.60 | bwd: 89.35 | bwd_inner: 2.08 | bwd_allreduce: 87.15 | step: 1.59
 40%|███▉      | 1393/3507 [33:52<37:55,  1.08s/it]                                                   {'loss': 0.2629, 'learning_rate': 1.3727406484752033e-05, 'epoch': 0.4}
 40%|███▉      | 1393/3507 [33:52<37:55,  1.08s/it]tensor([[-3.8594, -3.0000,  0.0889,  2.0156, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -1.6641,  1.4219,  0.0276, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:38,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.85 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.1875, -3.2500,  1.6484, -0.4629, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875e+00, -3.9375e+00, -7.7057e-04,  4.7852e-01, -3.9531e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5938, -0.9844,  2.3438,  0.1865, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -3.8594, -0.8320,  2.7031, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -3.7188, -1.1562,  3.0469, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -3.9219,  0.9492,  1.8984, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:18:40,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:18:40,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.09 | bwd_microstep: 1682.51 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1681.40 | step_microstep: 2.28
[2025-11-06 18:18:40,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.96 | bwd: 1683.44 | bwd_inner: 1.83 | bwd_allreduce: 1681.44 | step: 2.37
 40%|███▉      | 1394/3507 [33:54<48:15,  1.37s/it]                                                   {'loss': 0.4332, 'learning_rate': 1.3718833312231405e-05, 'epoch': 0.4}
 40%|███▉      | 1394/3507 [33:54<48:15,  1.37s/it]tensor([[-5.4688, -2.2031,  2.0156, -1.5391, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:40,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.84 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6875, -3.6719,  0.5430,  0.6016, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -1.8047,  1.0391, -2.1250, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4375, -2.9375, -0.0337,  3.2656, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -0.7266,  2.5000, -0.4492, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531, -3.0625, -1.4375,  1.7812, -1.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -3.1875,  0.6719,  0.9102, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8750, -3.8281,  1.0938,  1.5234, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:18:41,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:18:41,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.05 | bwd_microstep: 491.65 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 490.67 | step_microstep: 3.26
[2025-11-06 18:18:41,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.92 | bwd: 492.68 | bwd_inner: 1.83 | bwd_allreduce: 490.71 | step: 3.33
 40%|███▉      | 1395/3507 [33:55<43:54,  1.25s/it]                                                   {'loss': 0.2716, 'learning_rate': 1.3710256966542065e-05, 'epoch': 0.4}
 40%|███▉      | 1395/3507 [33:55<43:54,  1.25s/it]tensor([[-4.7188, -2.1406,  1.6016, -0.2412, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.2500, -5.3438,  0.4043, -1.1172, -6.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:42,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.11 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5312, -4.1250,  0.1279,  1.7734, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -5.4062, -2.7344,  1.5703, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -4.5312, -0.6523,  2.6875, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1562, -5.1875, -0.9219, -0.8594, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -5.1250, -1.3828,  0.9297, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2188, -3.2188, -1.0391,  2.6562, -1.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:18:43,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:18:43,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.14 | bwd_microstep: 1153.36 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1152.16 | step_microstep: 2.24
[2025-11-06 18:18:43,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 478.28 | bwd: 1154.29 | bwd_inner: 1.93 | bwd_allreduce: 1152.20 | step: 2.32
 40%|███▉      | 1396/3507 [33:57<48:26,  1.38s/it]                                                   {'loss': 0.3951, 'learning_rate': 1.3701677455001954e-05, 'epoch': 0.4}
 40%|███▉      | 1396/3507 [33:57<48:26,  1.38s/it]tensor([[-2.7812, -0.6250,  1.9297,  0.5156, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:43,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.91 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5312, -3.6094,  1.7500,  0.0120, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -1.9922,  0.9844, -0.2090, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -3.3750,  1.4922,  1.5781, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -0.1475,  3.7188, -1.3203, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -0.7148,  2.5312,  0.5352, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.7656, -0.8047,  2.5000, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9375, -2.1250,  0.5000,  2.2031, -1.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:18:44,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.29 | optimizer_step: 0.39
[2025-11-06 18:18:44,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.28 | bwd_microstep: 1139.69 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1138.59 | step_microstep: 2.99
[2025-11-06 18:18:44,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 268.20 | bwd: 1140.51 | bwd_inner: 1.69 | bwd_allreduce: 1138.66 | step: 3.08
 40%|███▉      | 1397/3507 [33:58<49:08,  1.40s/it]                                                   {'loss': 0.2362, 'learning_rate': 1.3693094784931708e-05, 'epoch': 0.4}
 40%|███▉      | 1397/3507 [33:58<49:08,  1.40s/it]tensor([[-4.9062, -3.6562,  0.0457,  1.6172, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5938, -0.6289,  1.8125,  0.6680, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0312, -3.1094,  0.7227,  3.7188, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:45,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.61 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9062, -2.0156,  2.0312, -0.4863, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -2.2031,  1.0547,  0.6953, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.9375, -2.3906, -0.9766,  3.3750, -0.1040]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3750,  0.1738,  2.9062,  0.4258, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -3.8438, -1.7812,  1.6094, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:18:46,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:18:46,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.20 | bwd_microstep: 449.30 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 448.32 | step_microstep: 2.07
[2025-11-06 18:18:46,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.84 | bwd: 450.47 | bwd_inner: 1.93 | bwd_allreduce: 448.37 | step: 2.16
 40%|███▉      | 1398/3507 [33:59<47:39,  1.36s/it]                                                   {'loss': 0.5, 'learning_rate': 1.3684508963654667e-05, 'epoch': 0.4}
 40%|███▉      | 1398/3507 [33:59<47:39,  1.36s/it]tensor([[-6.9375, -4.9375, -0.6016, -0.7227, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.1719, -0.1904,  1.7031, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.2656,  0.0747,  2.4062, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -0.4785,  3.7500,  1.9297, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:46,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.77 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2500, -0.7969,  3.3125, -0.6641, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -0.9180,  2.6719, -2.8906, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0469, -3.0312, -1.1094,  2.1562, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.1094,  1.2031,  1.3047, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:18:48,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:18:48,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.33 | bwd_microstep: 1599.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1598.02 | step_microstep: 2.01
[2025-11-06 18:18:48,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 611.13 | bwd: 1600.04 | bwd_inner: 1.82 | bwd_allreduce: 1598.06 | step: 2.08
 40%|███▉      | 1399/3507 [34:02<57:09,  1.63s/it]                                                   {'loss': 0.4882, 'learning_rate': 1.3675919998496846e-05, 'epoch': 0.4}
 40%|███▉      | 1399/3507 [34:02<57:09,  1.63s/it]tensor([[-4.2812, -4.2500, -1.8047,  2.0938, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9844, -1.6094,  0.4160,  3.0781, -0.6055]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:48,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.74 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3438, -0.8203,  2.8750, -1.7031, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0000, -4.5000,  0.8984,  0.4297, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.3438, -1.3203,  1.0469, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -2.1250,  1.5938,  2.3594, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.9219,  0.5703,  1.1250, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3359,  2.1250,  4.1250, -0.9805, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:18:48,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:18:48,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.10 | bwd_microstep: 40.28 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 39.17 | step_microstep: 1.73
[2025-11-06 18:18:48,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.86 | bwd: 41.09 | bwd_inner: 1.72 | bwd_allreduce: 39.21 | step: 1.81
 40%|███▉      | 1400/3507 [34:02<44:01,  1.25s/it]                                                   {'loss': 0.2829, 'learning_rate': 1.3667327896786959e-05, 'epoch': 0.4}
 40%|███▉      | 1400/3507 [34:02<44:01,  1.25s/it]tensor([[-2.7344, -3.0625, -1.3594,  2.6406, -0.8633]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:48,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.18 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.1875, -4.7812, -0.3457,  1.7031, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094,  0.3945,  3.3906, -2.0469, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -1.7188,  3.2656,  0.0938, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -2.5156,  0.8555, -2.7344, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -3.7656, -1.6406,  2.1875, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7344, -3.6719, -1.2109,  2.5938, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -2.1406,  2.4688, -0.6406, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:18:51,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:18:51,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.08 | bwd_microstep: 2161.73 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 2160.74 | step_microstep: 2.14
[2025-11-06 18:18:51,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 397.28 | bwd: 2162.73 | bwd_inner: 1.78 | bwd_allreduce: 2160.78 | step: 2.23
 40%|███▉      | 1401/3507 [34:05<58:11,  1.66s/it]                                                   {'loss': 0.0615, 'learning_rate': 1.3658732665856382e-05, 'epoch': 0.4}
 40%|███▉      | 1401/3507 [34:05<58:11,  1.66s/it]tensor([[-2.8438, -3.0781, -0.8945,  3.5625, -0.8242]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:51,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.87 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-4.8750, -1.9062,  1.1328, -2.1094, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -3.5312,  0.5000, -0.1377, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -2.0625,  2.2344, -0.7969, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0625, -5.0938,  0.1777,  1.0781, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -0.6289,  2.8750, -1.8125, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -1.3438,  2.0156,  0.3027, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -2.4688,  1.2188, -3.6406, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:18:51,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:18:51,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.73 | bwd_microstep: 140.34 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 139.06 | step_microstep: 1.98
[2025-11-06 18:18:51,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.62 | bwd: 141.15 | bwd_inner: 1.93 | bwd_allreduce: 139.10 | step: 2.04
 40%|███▉      | 1402/3507 [34:05<45:40,  1.30s/it]                                                   {'loss': 0.2346, 'learning_rate': 1.3650134313039169e-05, 'epoch': 0.4}
 40%|███▉      | 1402/3507 [34:05<45:40,  1.30s/it]tensor([[-4.1250, -2.8906,  0.7500,  2.5469, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -3.6719, -1.0859,  2.6250, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:18:52,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.55 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -4.4688, -0.0579,  0.7461, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5938,  0.8867,  3.2969, -1.7266, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4688,  1.1953,  2.4375, -1.2969, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-7.5625, -5.8438, -1.0547,  0.1523, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -3.2656,  1.7188, -1.3047, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -2.2500,  0.5586,  2.7344, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:18:54,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:18:54,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.18 | bwd_microstep: 2330.25 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 2329.36 | step_microstep: 1.73
[2025-11-06 18:18:54,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.75 | bwd: 2331.11 | bwd_inner: 1.55 | bwd_allreduce: 2329.41 | step: 1.81
 40%|████      | 1403/3507 [34:08<59:47,  1.71s/it]                                                   {'loss': 0.9363, 'learning_rate': 1.364153284567204e-05, 'epoch': 0.4}
 40%|████      | 1403/3507 [34:08<59:47,  1.71s/it]tensor([[-5.7812, -4.7812, -0.7188,  1.9609, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -3.4062,  1.7266,  0.4414, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:54,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.15 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1250, -3.8906, -1.6250,  1.4297, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9453,  0.8867,  4.0000,  1.2031, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -1.2500,  3.0312, -0.8828, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -2.6562,  1.1562,  0.7031, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -4.2812,  0.2285,  1.9219, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -3.3906, -0.9727,  2.2500, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:18:56,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 18:18:56,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.40 | bwd_microstep: 2099.76 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 2098.58 | step_microstep: 2.06
[2025-11-06 18:18:56,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.57 | bwd: 2100.69 | bwd_inner: 1.93 | bwd_allreduce: 2098.63 | step: 2.14
 40%|████      | 1404/3507 [34:10<1:07:49,  1.93s/it]                                                     {'loss': 0.1609, 'learning_rate': 1.3632928271094366e-05, 'epoch': 0.4}
 40%|████      | 1404/3507 [34:10<1:07:49,  1.93s/it]tensor([[-3.2031, -0.3066,  2.7031, -0.5742, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[3.2031, 3.4062, 4.5938, 7.7500, 3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -5.3125, -0.5742,  1.1406, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -1.4844,  1.8984,  1.4922, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:57,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.02 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2969,  0.8789,  3.2188, -1.7266, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -3.4375,  0.6641, -0.1475, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -2.2188,  1.1875, -0.9062, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812,  0.4199,  3.5781, -1.6953, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:57,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:18:57,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.69 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.44
[2025-11-06 18:18:57,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.74 | bwd: 3.07 | bwd_inner: 2.07 | bwd_allreduce: 0.86 | step: 1.51
 40%|████      | 1405/3507 [34:11<51:41,  1.48s/it]                                                     {'loss': 0.2975, 'learning_rate': 1.3624320596648166e-05, 'epoch': 0.4}
 40%|████      | 1405/3507 [34:11<51:41,  1.48s/it]tensor([[-4.5938, -2.4688, -0.1699, -1.8672, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:18:57,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.92 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.9375, -5.9375, -0.5234, -2.5781, -7.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -0.8828,  2.7344, -1.3516, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -2.2031,  1.5000, -0.6250, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -2.6406,  1.2891, -0.0092, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5625, -0.9531,  2.3281,  3.1094, -1.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -2.3438,  2.0312,  0.1484, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -2.7969, -0.2480,  1.4531, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:18:59,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.26
[2025-11-06 18:18:59,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.92 | bwd_microstep: 1401.69 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1400.54 | step_microstep: 2.55
[2025-11-06 18:18:59,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.88 | bwd: 1402.97 | bwd_inner: 2.23 | bwd_allreduce: 1400.60 | step: 2.63
 40%|████      | 1406/3507 [34:13<55:14,  1.58s/it]                                                   {'loss': 0.4892, 'learning_rate': 1.3615709829678122e-05, 'epoch': 0.4}
 40%|████      | 1406/3507 [34:13<55:14,  1.58s/it]tensor([[-2.5312,  1.1953,  3.5781, -2.6250, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -3.4531,  0.6211,  0.9844, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -1.5938,  0.7695, -3.3594, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5312, -2.5625,  2.4531,  0.3105, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:59,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.47 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7031, -0.6094,  2.5781, -0.8633, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8906,  0.7930,  2.3125, -1.4844, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -4.0938,  0.1523, -0.7773, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -2.6094,  0.5117,  1.3828, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:59,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:18:59,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.13 | bwd_microstep: 1.78 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 2.07
[2025-11-06 18:18:59,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.61 | bwd: 3.00 | bwd_inner: 2.10 | bwd_allreduce: 0.78 | step: 2.15
 40%|████      | 1407/3507 [34:13<43:44,  1.25s/it]                                                   {'loss': 0.6872, 'learning_rate': 1.360709597753153e-05, 'epoch': 0.4}
 40%|████      | 1407/3507 [34:13<43:44,  1.25s/it]tensor([[-0.6016,  1.6562,  3.3906,  1.3672, -0.5820]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:18:59,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.81 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5938, -3.6562,  0.9844, -1.0859, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0000, -5.3750, -0.9414,  0.2305, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.1406,  0.2129,  0.8594, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -2.6250,  0.8125, -0.2451, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2344, -1.7422,  0.9297,  4.2188, -0.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -1.7031,  2.2188,  0.0674, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -3.2500,  0.7773,  1.2344, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:02,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:19:02,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.41 | bwd_microstep: 2686.88 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 2685.66 | step_microstep: 2.09
[2025-11-06 18:19:02,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.25 | bwd: 2687.82 | bwd_inner: 2.00 | bwd_allreduce: 2685.70 | step: 2.17
 40%|████      | 1408/3507 [34:16<1:02:51,  1.80s/it]                                                     {'loss': 0.3281, 'learning_rate': 1.3598479047558341e-05, 'epoch': 0.4}
 40%|████      | 1408/3507 [34:16<1:02:51,  1.80s/it]tensor([[-3.5625, -3.7656, -2.0156,  1.7578, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.3594,  0.2754,  1.9141, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:02,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.69 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-5.9688, -4.6562, -0.5586,  1.2031, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -4.5625, -0.7031,  1.2266, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -0.5352,  3.2188, -3.0000, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -0.1553,  2.2969, -2.4375, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -2.8281,  0.3770, -1.4922, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -3.9062, -0.7383,  1.5938, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:03,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:19:03,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.24 | bwd_microstep: 1.87 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.73 | step_microstep: 1.41
[2025-11-06 18:19:03,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.95 | bwd: 2.77 | bwd_inner: 1.87 | bwd_allreduce: 0.77 | step: 1.51
 40%|████      | 1409/3507 [34:17<48:18,  1.38s/it]                                                     {'loss': 0.1236, 'learning_rate': 1.3589859047111118e-05, 'epoch': 0.4}
 40%|████      | 1409/3507 [34:17<48:18,  1.38s/it]tensor([[-4.7812, -4.5625, -1.4219,  2.6562, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:03,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.58 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.2812, -3.0312,  0.5938,  2.5000, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -3.8438,  0.1147,  2.5000, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -3.3438,  0.0352,  2.1094, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -4.3750, -1.4531,  2.8594, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -3.9688, -0.3066,  2.9219, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2188,  0.9727,  2.7812, -1.6406, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4375, -0.0835,  2.2969,  0.2090, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:04,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:19:04,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 831.57 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 830.45 | step_microstep: 2.05
[2025-11-06 18:19:04,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.95 | bwd: 832.49 | bwd_inner: 1.87 | bwd_allreduce: 830.49 | step: 2.12
 40%|████      | 1410/3507 [34:18<46:46,  1.34s/it]                                                   {'loss': 0.1059, 'learning_rate': 1.358123598354505e-05, 'epoch': 0.4}
 40%|████      | 1410/3507 [34:18<46:46,  1.34s/it]tensor([[-3.6562, -1.6094,  1.4531,  0.3281, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -4.3125, -1.7734,  2.2656, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -2.3438,  1.0703, -2.5781, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -3.3594, -0.9531,  3.0000, -1.3984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:04,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.16 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.8750, -4.9688,  0.0693,  1.1641, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.9844, -0.3066,  2.7188, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250, -2.4062,  0.2773,  2.8438, -1.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8438, -5.0000, -1.4297,  1.1953, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:19:06,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 18:19:06,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.34 | bwd_microstep: 1479.36 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1478.26 | step_microstep: 2.49
[2025-11-06 18:19:06,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.52 | bwd: 1480.35 | bwd_inner: 1.87 | bwd_allreduce: 1478.31 | step: 2.58
 40%|████      | 1411/3507 [34:20<52:07,  1.49s/it]                                                   {'loss': 0.4426, 'learning_rate': 1.3572609864217934e-05, 'epoch': 0.4}
 40%|████      | 1411/3507 [34:20<52:07,  1.49s/it]tensor([[-2.9688, -2.8125, -0.4727,  3.3281, -1.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:06,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.27 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.3125, -5.0000, -0.7188, -1.9062, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0938, -0.4961,  2.7969,  0.8633, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9062, -3.3125, -2.0000,  1.9453, -1.0391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -3.4531, -1.3672,  3.1094, -1.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9375, -5.1562, -1.1562,  2.4688, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.3340, 0.6719, 2.3281, 5.0625, 1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6562, -0.6523,  2.3438, -1.1172, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:06,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:19:06,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.99 | bwd_microstep: 75.83 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 74.69 | step_microstep: 1.74
[2025-11-06 18:19:06,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.28 | bwd: 76.65 | bwd_inner: 1.80 | bwd_allreduce: 74.72 | step: 1.82
 40%|████      | 1412/3507 [34:20<40:37,  1.16s/it]                                                   {'loss': 0.1024, 'learning_rate': 1.3563980696490184e-05, 'epoch': 0.4}
 40%|████      | 1412/3507 [34:20<40:37,  1.16s/it]tensor([[-5.7188, -3.1719,  0.4863, -1.3672, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -3.9219, -1.4062,  0.4863, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -4.5938, -0.2402, -1.1484, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -2.5938,  0.9570,  2.7656, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.0000,  1.7344,  1.0391, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -2.1562,  2.2031,  0.9648, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:07,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.38 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8438, -3.2969,  0.4219,  1.5703, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5312, -1.4141,  2.5000, -0.7656, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:19:09,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:19:09,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.98 | bwd_microstep: 1859.76 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1858.78 | step_microstep: 2.60
[2025-11-06 18:19:09,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 450.39 | bwd: 1860.66 | bwd_inner: 1.70 | bwd_allreduce: 1858.82 | step: 2.68
 40%|████      | 1413/3507 [34:23<1:00:27,  1.73s/it]                                                     {'loss': 0.6201, 'learning_rate': 1.3555348487724805e-05, 'epoch': 0.4}
 40%|████      | 1413/3507 [34:23<1:00:27,  1.73s/it]tensor([[-4.3125, -2.6250,  1.2656,  1.9453, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -2.8750,  0.5352,  1.5859, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:09,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.92 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5469,  0.3848,  2.9219,  0.0176, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -2.7188,  1.0859,  2.7344, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -2.2656,  0.5859,  1.4219, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -0.6133,  2.5312, -2.1562, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -1.3672,  1.7109,  1.6172, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -2.7812,  1.9453,  0.7109, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:10,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:19:10,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 200.57 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 199.25 | step_microstep: 1.37
[2025-11-06 18:19:10,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.28 | bwd: 201.46 | bwd_inner: 2.03 | bwd_allreduce: 199.29 | step: 1.44
 40%|████      | 1414/3507 [34:24<48:55,  1.40s/it]                                                     {'loss': 0.3084, 'learning_rate': 1.3546713245287407e-05, 'epoch': 0.4}
 40%|████      | 1414/3507 [34:24<48:55,  1.40s/it]tensor([[-3.2656,  0.4121,  3.9375, -0.9297, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -3.3438,  0.4277,  1.4609, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:10,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.14 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.5000, -4.2188, -1.0469,  2.8906, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.6875,  1.1875,  2.5781, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -4.2812, -2.4219,  2.1875, -1.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6953, -1.4844,  0.5938,  3.7656, -0.2432]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -3.0781, -0.0156,  2.0469, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -1.2422,  3.3281, -1.3359, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:10,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:19:10,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.95 | bwd_microstep: 67.35 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 66.22 | step_microstep: 1.51
[2025-11-06 18:19:10,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.12 | bwd: 68.22 | bwd_inner: 1.83 | bwd_allreduce: 66.26 | step: 1.60
 40%|████      | 1415/3507 [34:24<38:42,  1.11s/it]                                                   {'loss': 0.5054, 'learning_rate': 1.3538074976546174e-05, 'epoch': 0.4}
 40%|████      | 1415/3507 [34:24<38:42,  1.11s/it]tensor([[-5.4688, -2.5625,  1.5000, -0.8164, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.3125,  0.2773,  1.3750, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.7812, -0.6641,  2.1094, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -4.5938, -2.4531,  2.3438, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -4.3750, -0.7930,  1.7812, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:11,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.29 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9062, -3.9375, -1.2891,  2.9844, -1.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -1.6094,  2.8281, -0.3320, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -4.0312,  0.1030,  0.3203, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:12,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.23 | optimizer_step: 0.19
[2025-11-06 18:19:12,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.22 | bwd_microstep: 760.62 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 759.62 | step_microstep: 2.85
[2025-11-06 18:19:12,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.54 | bwd: 761.49 | bwd_inner: 1.63 | bwd_allreduce: 759.68 | step: 2.95
 40%|████      | 1416/3507 [34:26<43:21,  1.24s/it]                                                   {'loss': 0.5334, 'learning_rate': 1.3529433688871887e-05, 'epoch': 0.4}
 40%|████      | 1416/3507 [34:26<43:21,  1.24s/it]tensor([[-2.9531, -1.4297,  1.6484,  2.0156, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:12,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 75.75 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6719, -3.8750, -1.9062,  2.2031, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -2.4219,  1.7188, -3.0156, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -2.0000,  2.1875, -0.0205, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -3.9062,  0.0315,  1.8125, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7344,  0.5391,  3.2656, -1.3047, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2969, -0.1494,  2.9531, -0.6836, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8594, -0.4766,  2.4062, -2.3594, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:19:14,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.32 | optimizer_step: 0.24
[2025-11-06 18:19:14,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.85 | bwd_microstep: 1349.12 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1347.97 | step_microstep: 2.52
[2025-11-06 18:19:14,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 226.61 | bwd: 1349.97 | bwd_inner: 1.78 | bwd_allreduce: 1348.02 | step: 2.60
 40%|████      | 1417/3507 [34:27<48:59,  1.41s/it]                                                   {'loss': 0.1875, 'learning_rate': 1.3520789389637898e-05, 'epoch': 0.4}
 40%|████      | 1417/3507 [34:27<48:59,  1.41s/it]tensor([[-4.9062, -2.7344,  0.8281, -0.0669, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -4.1562, -1.2109,  2.0312, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -3.8281,  0.7930,  2.2031, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[h264 @ 0x9d49780] mmco: unref short failure
tensor([[-4.9062, -2.7188,  1.4609,  1.2578, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:15,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.93 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -2.5469,  1.2031,  0.4219, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -4.6875, -1.0000,  2.5625, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1875, -4.2188,  0.6719,  1.4141, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -3.7188,  1.1406,  1.7422, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:16,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:19:16,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.78 | bwd_microstep: 1390.20 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1389.20 | step_microstep: 2.22
[2025-11-06 18:19:16,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.74 | bwd: 1391.11 | bwd_inner: 1.69 | bwd_allreduce: 1389.25 | step: 2.30
 40%|████      | 1418/3507 [34:30<1:00:19,  1.73s/it]                                                     {'loss': 0.3155, 'learning_rate': 1.3512142086220128e-05, 'epoch': 0.4}
 40%|████      | 1418/3507 [34:30<1:00:19,  1.73s/it]tensor([[-4.5625, -1.1875,  3.1250, -0.4160, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4375, -0.2715,  2.4531,  1.2969, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:16,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.90 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1875, -2.4062,  1.2656,  1.8359, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562, -3.0781, -1.3359,  3.1875, -0.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2188, -2.1875, -1.0078,  1.5469, -0.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0781,  0.2217,  2.6250, -1.8125, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0625, -2.0625,  0.8320,  0.1060, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -2.1250,  2.5625, -1.2344, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:19:17,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:19:17,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.62 | bwd_microstep: 181.17 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 180.22 | step_microstep: 1.49
[2025-11-06 18:19:17,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.52 | bwd: 182.02 | bwd_inner: 1.65 | bwd_allreduce: 180.25 | step: 1.57
 40%|████      | 1419/3507 [34:30<47:19,  1.36s/it]                                                     {'loss': 0.5079, 'learning_rate': 1.3503491785997053e-05, 'epoch': 0.4}
 40%|████      | 1419/3507 [34:30<47:19,  1.36s/it]tensor([[-4.0625, -0.3242,  3.8906, -1.2344, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -3.4531, -0.5820,  3.6562, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -0.0952,  3.1562, -0.9883, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:17,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.30 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.2812, -5.3750, -1.0781,  2.1406, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -1.5703,  2.7969, -1.9219, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6875, -2.7656,  0.4668,  2.8438, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094, -2.9375, -1.3125,  1.3125, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -1.8516,  2.7344,  0.7773, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:19,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:19:19,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.83 | bwd_microstep: 1727.23 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1725.97 | step_microstep: 2.94
[2025-11-06 18:19:19,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.15 | bwd: 1728.15 | bwd_inner: 1.97 | bwd_allreduce: 1726.02 | step: 3.03
 40%|████      | 1420/3507 [34:33<55:16,  1.59s/it]                                                   {'loss': 0.0692, 'learning_rate': 1.3494838496349729e-05, 'epoch': 0.4}
 40%|████      | 1420/3507 [34:33<55:16,  1.59s/it]tensor([[-3.9531, -3.1094, -0.1367,  1.7422, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0781, -2.3125, -1.3672,  1.5625, -0.5977]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:19:19,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.75 | bwd_microstep: 4.19 | bwd_inner_microstep: 4.09 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3438, -2.0312,  1.6094,  0.3965, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.0000, -6.2500, -0.9883,  0.4160, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -3.7344, -0.7930,  2.7188, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531, -3.1406, -1.3750,  2.5625, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.8750, -0.0425,  2.3438, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5938, -3.9375, -0.8945,  1.7344, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:19,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:19:19,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.79 | bwd_microstep: 53.16 | bwd_inner_microstep: 2.01 | bwd_allreduce_microstep: 51.01 | step_microstep: 2.34
[2025-11-06 18:19:19,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.56 | bwd: 57.35 | bwd_inner: 6.12 | bwd_allreduce: 51.05 | step: 2.42
 41%|████      | 1421/3507 [34:33<43:41,  1.26s/it]                                                   {'loss': 0.8875, 'learning_rate': 1.3486182224661732e-05, 'epoch': 0.41}
 41%|████      | 1421/3507 [34:33<43:41,  1.26s/it]tensor([[-4.8438, -3.6719,  0.0603,  2.0000, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3125, -3.6250, -1.1719,  3.7500, -1.1484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -5.6250, -2.2656,  1.2344, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7500, -5.5312, -0.0601,  0.5117, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -1.3672,  1.6250, -0.3730, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -3.0469,  0.8398,  1.4531, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:20,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.90 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -3.3281, -0.4746,  1.9688, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -0.7344,  3.0938, -1.8438, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:23,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:19:23,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.80 | bwd_microstep: 3047.03 | bwd_inner_microstep: 4.63 | bwd_allreduce_microstep: 3042.30 | step_microstep: 3.45
[2025-11-06 18:19:23,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.72 | bwd: 3047.74 | bwd_inner: 5.25 | bwd_allreduce: 3042.34 | step: 3.54
 41%|████      | 1422/3507 [34:37<1:10:53,  2.04s/it]                                                     {'loss': 0.1765, 'learning_rate': 1.3477522978319208e-05, 'epoch': 0.41}
 41%|████      | 1422/3507 [34:37<1:10:53,  2.04s/it]tensor([[-1.4766,  1.0625,  2.4219, -0.6523, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.1875, -0.7539,  2.4531, -2.2969, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.9062, -1.7188,  1.8594, -2.0781, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:23,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.41 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.6094, -1.4453,  2.2812,  1.8047, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6875, -4.0312,  1.4219,  0.5195, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -0.2314,  1.4375, -2.2500, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.2500, -4.0625, -1.3984,  2.2500, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -3.8281, -2.1562,  2.2344, -1.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:24,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:19:24,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.49 | bwd_microstep: 29.15 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 28.00 | step_microstep: 1.86
[2025-11-06 18:19:24,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.92 | bwd: 30.04 | bwd_inner: 1.84 | bwd_allreduce: 28.06 | step: 1.93
 41%|████      | 1423/3507 [34:37<53:56,  1.55s/it]                                                     {'loss': 0.9716, 'learning_rate': 1.3468860764710835e-05, 'epoch': 0.41}
 41%|████      | 1423/3507 [34:37<53:56,  1.55s/it]tensor([[-2.2969,  1.2188,  3.4844, -2.1406, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -4.9062, -0.2344,  2.5469, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.9531, -0.2949,  2.8438, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:24,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -1.9453,  1.6328,  1.1719, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -5.0938, -2.3906,  1.8594, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9375, -5.3125, -1.3438,  2.3750, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.6406,  0.4766,  1.4141, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -1.0859,  3.1250,  0.0605, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:26,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:19:26,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.75 | bwd_microstep: 1665.09 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1663.90 | step_microstep: 1.77
[2025-11-06 18:19:26,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.35 | bwd: 1665.82 | bwd_inner: 1.74 | bwd_allreduce: 1663.95 | step: 1.86
 41%|████      | 1424/3507 [34:39<58:58,  1.70s/it]                                                   {'loss': 0.5538, 'learning_rate': 1.3460195591227806e-05, 'epoch': 0.41}
 41%|████      | 1424/3507 [34:39<58:58,  1.70s/it]tensor([[-5.0000, -4.3750, -0.9609,  2.3125, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -1.5859,  1.9062,  0.8477, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -2.0156,  2.5781, -1.1484, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -2.1250,  1.7578, -0.0183, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:26,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.03 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.6875, -4.4688,  0.7461,  0.8359, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -2.1406,  1.4375, -0.4785, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -1.8984,  2.3906, -2.9375, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7188, -0.7852,  1.7812, -2.0938, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:26,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:19:26,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.52 | bwd_microstep: 2.35 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1.08 | step_microstep: 2.30
[2025-11-06 18:19:26,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.62 | bwd: 3.24 | bwd_inner: 1.96 | bwd_allreduce: 1.11 | step: 2.37
 41%|████      | 1425/3507 [34:40<46:57,  1.35s/it]                                                   {'loss': 0.6105, 'learning_rate': 1.3451527465263867e-05, 'epoch': 0.41}
 41%|████      | 1425/3507 [34:40<46:57,  1.35s/it]tensor([[-4.5000, -1.3516,  2.2969, -1.3516, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.1250, -3.5469, -0.0447,  3.6719, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -3.2031,  1.1406,  3.2031, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:26,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.27 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0000, -2.0312,  2.2500,  2.3594, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -4.5000, -1.7656,  2.7344, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -2.7969,  0.3652,  3.1562, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -1.3828,  2.2812, -0.5938, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -4.6562, -0.4961,  1.4453, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:29,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:19:29,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.65 | bwd_microstep: 2511.81 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2510.72 | step_microstep: 2.43
[2025-11-06 18:19:29,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.94 | bwd: 2512.64 | bwd_inner: 1.76 | bwd_allreduce: 2510.76 | step: 2.50
 41%|████      | 1426/3507 [34:43<1:03:05,  1.82s/it]                                                     {'loss': 0.6027, 'learning_rate': 1.3442856394215262e-05, 'epoch': 0.41}
 41%|████      | 1426/3507 [34:43<1:03:05,  1.82s/it]tensor([[-5.4062, -1.7734,  2.7656, -1.5547, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938, -3.9688, -2.0469,  2.6250, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -2.8906,  0.7188,  0.9727, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2344, -0.4219,  2.6719, -0.3926, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:19:29,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.88 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9531, -0.2715,  2.6094, -2.9062, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.9531, -3.0938,  0.3496,  3.0781, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.0625,  2.2656,  1.8750, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -3.3906, -0.4375,  2.3438, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:30,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:19:30,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.40 | bwd_microstep: 42.74 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 41.65 | step_microstep: 4.47
[2025-11-06 18:19:30,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.31 | bwd: 43.49 | bwd_inner: 1.67 | bwd_allreduce: 41.68 | step: 4.54
 41%|████      | 1427/3507 [34:43<49:26,  1.43s/it]                                                     {'loss': 1.3598, 'learning_rate': 1.3434182385480756e-05, 'epoch': 0.41}
 41%|████      | 1427/3507 [34:43<49:26,  1.43s/it]tensor([[-4.5312, -4.6250, -1.8359,  2.7812, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:30,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.08 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8750, -3.9844, -0.5820,  1.8672, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -3.5781,  0.1318,  1.4922, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531,  0.1846,  3.3125, -2.3125, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -2.6250,  0.8984, -0.8320, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -2.0312,  2.5312, -0.5977, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7656, -1.5312,  1.9609,  0.7891, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -1.8047,  2.6719,  0.5117, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:31,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.17 | optimizer_step: 0.23
[2025-11-06 18:19:31,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.47 | bwd_microstep: 1075.21 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1074.15 | step_microstep: 3.15
[2025-11-06 18:19:31,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.56 | bwd: 1075.91 | bwd_inner: 1.58 | bwd_allreduce: 1074.19 | step: 3.23
 41%|████      | 1428/3507 [34:45<49:03,  1.42s/it]                                                   {'loss': 0.1341, 'learning_rate': 1.3425505446461625e-05, 'epoch': 0.41}
 41%|████      | 1428/3507 [34:45<49:03,  1.42s/it]tensor([[-4.8750, -1.3203,  2.5469, -1.7188, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -3.6562,  1.4062,  0.5469, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:31,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.18 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3750, -1.2422,  2.2969, -1.3984, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -4.7188,  0.1270,  1.3359, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -3.4375,  0.8672, -1.2969, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -3.6875, -0.0238,  1.8594, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5781, -2.5625, -1.1328,  1.7578, -1.0078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.9844, -2.4531,  1.0000,  2.1094, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:32,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:19:32,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.67 | bwd_microstep: 708.08 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 706.98 | step_microstep: 1.75
[2025-11-06 18:19:32,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 497.88 | bwd: 708.93 | bwd_inner: 1.79 | bwd_allreduce: 707.01 | step: 1.82
 41%|████      | 1429/3507 [34:46<47:18,  1.37s/it]                                                   {'loss': 0.5252, 'learning_rate': 1.3416825584561632e-05, 'epoch': 0.41}
 41%|████      | 1429/3507 [34:46<47:18,  1.37s/it]tensor([[-6.8438, -5.6875, -0.8672,  2.1406, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:32,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.71 | bwd_microstep: 3.41 | bwd_inner_microstep: 3.30 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.2344,  0.7695,  3.9062,  0.5547, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -2.8281, -0.1924,  2.3125, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -2.7500,  1.5000,  0.2949, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6875,  0.7070,  3.8438, -0.7617, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -4.0312, -0.4785,  3.1719, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -0.6953,  3.6406, -1.6172, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[h264 @ 0xc343300] mmco: unref short failure
tensor([[-5.0625, -1.6328,  2.8281, -1.1562, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:34,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:19:34,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.29 | bwd_microstep: 1013.63 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1012.64 | step_microstep: 1.96
[2025-11-06 18:19:34,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.04 | bwd: 1017.04 | bwd_inner: 4.21 | bwd_allreduce: 1012.69 | step: 2.06
 41%|████      | 1430/3507 [34:47<47:42,  1.38s/it]                                                   {'loss': 0.2314, 'learning_rate': 1.3408142807187048e-05, 'epoch': 0.41}
 41%|████      | 1430/3507 [34:47<47:42,  1.38s/it]tensor([[-3.0000,  0.7656,  3.1875, -2.6406, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[3.8594, 4.0312, 5.1250, 8.0625, 4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -0.6875,  1.9531, -0.8242, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:34,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -2.9375,  1.2812,  0.5742, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -1.5547,  1.4375,  2.9062, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -2.9219,  0.6523,  2.9219, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6562,  1.1953,  3.1562, -0.3047, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875,  0.2432,  4.2500, -1.5078, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:34,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:19:34,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.10 | bwd_microstep: 241.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 241.03 | step_microstep: 2.03
[2025-11-06 18:19:34,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.55 | bwd: 242.91 | bwd_inner: 1.73 | bwd_allreduce: 241.07 | step: 2.10
 41%|████      | 1431/3507 [34:48<40:03,  1.16s/it]                                                   {'loss': 0.2478, 'learning_rate': 1.3399457121746626e-05, 'epoch': 0.41}
 41%|████      | 1431/3507 [34:48<40:03,  1.16s/it]tensor([[-3.9688, -3.9844, -1.6094,  2.1250, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5781, -1.5312,  1.3281,  2.9844, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.0156,  0.8320,  1.8203, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:34,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.57 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -4.7500, -1.7266,  1.7266, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -3.0469,  1.6250, -0.1846, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9688, -3.3125,  1.5234,  0.1426, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -0.9766,  3.0625, -0.4199, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3750, -2.9688, -0.3652,  2.8906, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:37,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:19:37,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.67 | bwd_microstep: 2122.42 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 2121.45 | step_microstep: 1.79
[2025-11-06 18:19:37,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.25 | bwd: 2123.33 | bwd_inner: 1.70 | bwd_allreduce: 2121.50 | step: 1.87
 41%|████      | 1432/3507 [34:50<53:45,  1.55s/it]                                                   {'loss': 0.5188, 'learning_rate': 1.3390768535651598e-05, 'epoch': 0.41}
 41%|████      | 1432/3507 [34:51<53:45,  1.55s/it]tensor([[-3.5312, -3.5000, -1.4453,  2.0469, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -2.5938,  0.8320,  2.7031, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.8438, -0.8711,  2.7031, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7188, -3.6719, -1.7266,  1.3828, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:37,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6250, -0.6367,  2.4375, -1.3516, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531,  0.4473,  3.6719, -0.2578, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -2.6406,  2.4219,  0.2119, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -0.4844,  2.9531, -2.3125, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:37,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:19:37,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.78 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.02
[2025-11-06 18:19:37,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.90 | bwd: 3.04 | bwd_inner: 2.02 | bwd_allreduce: 0.87 | step: 2.10
 41%|████      | 1433/3507 [34:51<44:39,  1.29s/it]                                                   {'loss': 0.3128, 'learning_rate': 1.3382077056315672e-05, 'epoch': 0.41}
 41%|████      | 1433/3507 [34:51<44:39,  1.29s/it]tensor([[-4.7188, -4.3438, -0.7383,  3.5156, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -2.9375,  0.7461,  2.0938, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -1.5938,  1.6484, -0.2715, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -3.1094,  0.3242,  1.3281, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:38,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.06 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6250, -3.2031,  0.5078,  1.7656, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.6602, 2.4531, 4.2812, 3.2344, 0.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -2.6875,  0.4863,  1.0469, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -4.1250, -0.7656,  1.0234, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:38,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:19:38,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.86 | bwd_microstep: 329.42 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 328.35 | step_microstep: 1.63
[2025-11-06 18:19:38,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.95 | bwd: 330.75 | bwd_inner: 2.22 | bwd_allreduce: 328.39 | step: 1.71
 41%|████      | 1434/3507 [34:52<38:32,  1.12s/it]                                                   {'loss': 0.9036, 'learning_rate': 1.3373382691155035e-05, 'epoch': 0.41}
 41%|████      | 1434/3507 [34:52<38:32,  1.12s/it]tensor([[-4.3125, -0.9648,  2.6094, -1.3594, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -2.4688,  1.2734,  2.4219, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -3.3594, -0.4102,  3.2344, -1.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.1250,  0.4785,  2.2344, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7500, -5.9375, -2.2188,  0.7695, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:39,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.02 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2812, -4.0312, -1.1953,  2.6250, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -3.5000, -0.8438,  4.0625, -1.0859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7969, -0.8750,  3.0469,  0.3867, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:19:40,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:19:40,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.39 | bwd_microstep: 176.62 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 175.68 | step_microstep: 2.23
[2025-11-06 18:19:40,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.45 | bwd: 177.35 | bwd_inner: 1.45 | bwd_allreduce: 175.73 | step: 2.32
 41%|████      | 1435/3507 [34:54<44:56,  1.30s/it]                                                   {'loss': 0.3115, 'learning_rate': 1.3364685447588315e-05, 'epoch': 0.41}
 41%|████      | 1435/3507 [34:54<44:56,  1.30s/it]tensor([[-1.8984,  0.5352,  2.3906, -0.4336, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -2.9844,  0.7305,  2.5000, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -5.3438, -1.7734,  0.4727, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -3.6406,  1.0547,  0.9805, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:40,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.27 | bwd_microstep: 1.30 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.3125, -6.4688, -3.4844,  1.5938, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938, -3.8906, -2.2500,  1.7734, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781, -0.3125,  2.4844, -2.0000, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -1.6406,  2.5625, -0.5586, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:42,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:19:42,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.43 | bwd_microstep: 2197.78 | bwd_inner_microstep: 6.92 | bwd_allreduce_microstep: 2190.76 | step_microstep: 2.61
[2025-11-06 18:19:42,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.74 | bwd: 2199.07 | bwd_inner: 8.07 | bwd_allreduce: 2190.82 | step: 2.72
 41%|████      | 1436/3507 [34:56<58:32,  1.70s/it]                                                   {'loss': 0.6824, 'learning_rate': 1.335598533303662e-05, 'epoch': 0.41}
 41%|████      | 1436/3507 [34:56<58:32,  1.70s/it]tensor([[-2.6562, -2.8281, -0.9648,  2.9688, -0.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -3.1875, -0.0703,  3.7344, -1.4922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5781,  2.9062,  5.0938, -0.2715, -1.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.7344,  0.1006,  1.4688, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.0469,  2.6250,  0.4102, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:43,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.65 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1562, -4.5938, -1.3906,  1.7109, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -1.6172,  1.9297,  1.1719, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -5.0312, -1.4922,  2.3125, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:43,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:19:43,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.67 | bwd_microstep: 140.77 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 139.82 | step_microstep: 2.29
[2025-11-06 18:19:43,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.34 | bwd: 141.57 | bwd_inner: 1.51 | bwd_allreduce: 139.89 | step: 2.38
 41%|████      | 1437/3507 [34:57<51:35,  1.50s/it]                                                   {'loss': 0.3012, 'learning_rate': 1.3347282354923486e-05, 'epoch': 0.41}
 41%|████      | 1437/3507 [34:57<51:35,  1.50s/it]tensor([[-4.0938, -4.2188, -1.7109,  2.6719, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.4375, -1.7500,  2.7656, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -1.7969,  1.7031, -0.5039, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -3.5156, -0.4395,  3.2969, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:44,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.54 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5000, -6.3750, -3.4688,  0.2119, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625, -0.6406,  2.9219,  1.7734, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -2.8750,  0.9062,  1.1875, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3438, -1.9375,  2.1875,  1.0703, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:46,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:19:46,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.88 | bwd_microstep: 2528.95 | bwd_inner_microstep: 7.78 | bwd_allreduce_microstep: 2521.01 | step_microstep: 7.26
[2025-11-06 18:19:46,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.47 | bwd: 2529.76 | bwd_inner: 8.53 | bwd_allreduce: 2521.04 | step: 7.34
 41%|████      | 1438/3507 [35:00<1:06:30,  1.93s/it]                                                     {'loss': 0.8136, 'learning_rate': 1.333857652067491e-05, 'epoch': 0.41}
 41%|████      | 1438/3507 [35:00<1:06:30,  1.93s/it]tensor([[-4.0000, -2.5156,  0.6250,  1.2734, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:47,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.21 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.2500, -4.8125, -0.5742,  0.8359, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -3.1875,  1.7266, -0.8867, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.0000,  0.4238,  2.2031, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.1250, -5.8125, -0.7344, -0.7734, -6.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-7.0938, -7.3125, -3.8906,  1.3906, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([3], device='cuda:3')
tensor([[-2.5156, -0.0327,  2.2969, -0.1416, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5312, -3.1406, -0.7891,  1.8672, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:47,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:19:47,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.56 | bwd_microstep: 492.54 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 491.57 | step_microstep: 1.93
[2025-11-06 18:19:47,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.80 | bwd: 493.25 | bwd_inner: 1.47 | bwd_allreduce: 491.61 | step: 2.01
 41%|████      | 1439/3507 [35:01<55:48,  1.62s/it]                                                     {'loss': 0.609, 'learning_rate': 1.332986783771932e-05, 'epoch': 0.41}
 41%|████      | 1439/3507 [35:01<55:48,  1.62s/it]tensor([[-3.6094, -3.9844, -2.2031,  2.1250, -1.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.5000,  1.3125,  0.8398, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:47,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.55 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2109, -1.7578, -1.1406,  2.3594,  0.2178]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.9219, -0.5078,  3.0781, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -2.7344,  1.2891,  1.7500, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -4.4062, -1.8203,  2.2500, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1875, -0.4590,  2.9688,  0.6992, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -2.3750,  0.1064, -1.1719, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:48,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:19:48,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.01 | bwd_microstep: 819.89 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 819.08 | step_microstep: 2.49
[2025-11-06 18:19:48,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.59 | bwd: 820.57 | bwd_inner: 1.29 | bwd_allreduce: 819.12 | step: 2.57
 41%|████      | 1440/3507 [35:02<51:37,  1.50s/it]                                                   {'loss': 0.3369, 'learning_rate': 1.3321156313487565e-05, 'epoch': 0.41}
 41%|████      | 1440/3507 [35:02<51:37,  1.50s/it]tensor([[-6.0938, -3.7500,  1.1719,  0.8906, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[h264 @ 0x87c61c0] mmco: unref short failure
tensor([[-1.5547, -1.7969, -0.7070,  2.6406, -0.0718]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:49,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.58 | bwd_microstep: 3.49 | bwd_inner_microstep: 3.36 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9531, -1.3203,  2.6250,  0.8945, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -3.7500,  0.9766,  1.2734, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -1.9062,  1.3125,  0.0315, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -1.1328,  3.4688, -2.2500, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -1.7891,  3.0938, -1.6484, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -3.1562, -0.4453,  2.4844, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:19:51,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:19:51,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.31 | bwd_microstep: 2377.17 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2376.11 | step_microstep: 2.48
[2025-11-06 18:19:51,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.92 | bwd: 2380.67 | bwd_inner: 4.34 | bwd_allreduce: 2376.17 | step: 2.57
 41%|████      | 1441/3507 [35:05<1:06:31,  1.93s/it]                                                     {'loss': 0.2651, 'learning_rate': 1.331244195541293e-05, 'epoch': 0.41}
 41%|████      | 1441/3507 [35:05<1:06:31,  1.93s/it]tensor([[-4.9688, -1.1094,  3.1250, -2.3125, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406, -0.2617,  2.8125, -0.5508, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:52,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.95 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8594, -3.7969, -0.9062,  3.2344, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8438, -3.5312,  1.2109, -1.8984, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -5.5938, -3.3906,  0.9648, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -3.2031,  1.8906,  1.2266, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.1875, -5.2812, -1.8125, -4.6875, -6.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.1875, -3.1094,  2.2344,  0.1699, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:53,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:19:53,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.13 | bwd_microstep: 749.57 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 748.60 | step_microstep: 2.18
[2025-11-06 18:19:53,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.09 | bwd: 750.58 | bwd_inner: 1.78 | bwd_allreduce: 748.64 | step: 2.27
 41%|████      | 1442/3507 [35:06<57:58,  1.68s/it]                                                     {'loss': 0.6238, 'learning_rate': 1.3303724770931123e-05, 'epoch': 0.41}
 41%|████      | 1442/3507 [35:06<57:58,  1.68s/it]tensor([[-0.0330,  1.8984,  2.4688, -0.1050, -0.3613]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:19:53,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.80 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3906, -2.5312,  0.8828,  3.6406, -1.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -3.7500, -0.4023,  2.9219, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -0.8047,  2.5781, -2.2500, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -3.3125,  0.6445,  0.7734, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5625, -0.0466,  2.9375,  0.7930, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688,  0.5820,  3.9531, -1.0859, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -2.6875,  1.8516, -0.2617, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:19:53,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:19:53,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.98 | bwd_microstep: 2.22 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 0.69 | step_microstep: 1.34
[2025-11-06 18:19:53,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.81 | bwd: 3.10 | bwd_inner: 2.25 | bwd_allreduce: 0.73 | step: 1.42
 41%|████      | 1443/3507 [35:07<44:54,  1.31s/it]                                                   {'loss': 0.3016, 'learning_rate': 1.3295004767480246e-05, 'epoch': 0.41}
 41%|████      | 1443/3507 [35:07<44:54,  1.31s/it]tensor([[-5.8125, -3.2031,  1.6953,  0.3809, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6562, -3.6719, -1.4141,  2.3125, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:53,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.29 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.9688, -3.0781,  0.6758,  0.7070, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2812, -6.4375, -2.0000,  1.3359, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -3.2500,  0.7695,  1.8359, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -3.9531,  0.2178,  0.7031, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -2.1875,  1.6094,  0.2617, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5156,  0.2402,  2.2188, -1.2969, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:56,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:19:56,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.15 | bwd_microstep: 2495.80 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2494.64 | step_microstep: 2.04
[2025-11-06 18:19:56,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.47 | bwd: 2496.70 | bwd_inner: 1.87 | bwd_allreduce: 2494.69 | step: 2.14
 41%|████      | 1444/3507 [35:10<1:01:05,  1.78s/it]                                                     {'loss': 0.4194, 'learning_rate': 1.328628195250082e-05, 'epoch': 0.41}
 41%|████      | 1444/3507 [35:10<1:01:05,  1.78s/it]tensor([[-2.8906, -2.8594, -1.0781,  2.1250, -1.1797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.2812, -4.5000, -2.3750,  1.5938, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:56,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.84 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -1.6797,  2.4062,  0.2812, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875, -4.1250, -2.8750,  0.9570, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -1.8828,  2.1875, -2.6094, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -2.4531,  2.3281,  0.4414, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[ 0.5117,  3.3594,  4.8750,  0.8320, -0.1328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969,  0.4922,  2.1719, -1.3125, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:19:56,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:19:56,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.08 | bwd_microstep: 132.87 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 132.02 | step_microstep: 3.19
[2025-11-06 18:19:56,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.95 | bwd: 133.79 | bwd_inner: 1.57 | bwd_allreduce: 132.07 | step: 3.27
 41%|████      | 1445/3507 [35:10<47:57,  1.40s/it]                                                     {'loss': 0.7472, 'learning_rate': 1.3277556333435757e-05, 'epoch': 0.41}
 41%|████      | 1445/3507 [35:10<47:57,  1.40s/it]tensor([[-3.8750, -2.6719,  0.6016,  2.0000, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -3.5781, -0.8203,  3.8125, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344, -0.5156,  2.1406, -0.7695, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -4.8438, -0.9883,  0.1797, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:57,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.56 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.1562,  0.0425,  3.0938, -0.8164, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.5000,  0.4473,  1.4844, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -1.8984,  0.3164, -1.6719, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.0312, -4.3750, -0.6328, -2.9375, -5.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:19:59,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:19:59,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.31 | bwd_microstep: 2208.85 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2207.79 | step_microstep: 2.18
[2025-11-06 18:19:59,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.91 | bwd: 2209.61 | bwd_inner: 1.65 | bwd_allreduce: 2207.82 | step: 2.25
 41%|████      | 1446/3507 [35:13<1:01:10,  1.78s/it]                                                     {'loss': 0.3536, 'learning_rate': 1.3268827917730374e-05, 'epoch': 0.41}
 41%|████      | 1446/3507 [35:13<1:01:10,  1.78s/it]tensor([[-4.9688, -4.1250, -0.8633,  1.2109, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1719,  1.8203,  3.0156, -1.6641, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.8594, -1.9766,  1.1016,  0.2041, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -2.2031,  1.8438,  0.8320, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:19:59,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.49 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.9062, -1.2891,  2.0781, -0.0889, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -3.6875, -1.9062,  1.8359, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6562,  1.6797,  3.3750, -2.0000, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.2812, -3.4219,  0.1943,  3.1094, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:00,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:20:00,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.66 | bwd_microstep: 2.12 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.94 | step_microstep: 2.06
[2025-11-06 18:20:00,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 486.18 | bwd: 3.17 | bwd_inner: 1.99 | bwd_allreduce: 0.99 | step: 2.16
 41%|████▏     | 1447/3507 [35:13<48:20,  1.41s/it]                                                     {'loss': 0.9572, 'learning_rate': 1.3260096712832355e-05, 'epoch': 0.41}
 41%|████▏     | 1447/3507 [35:13<48:20,  1.41s/it]tensor([[-5.1562, -2.9375,  1.7422,  1.5234, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:00,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.87 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.2812, -5.6562, -2.6406, -0.1396, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9219,  0.6289,  1.4375, -2.0156, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.0078, -1.8281, -0.3789,  5.1562,  0.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -1.9922,  1.9844,  0.4121, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.6602,  1.8672,  3.3125, -0.0708, -0.9570]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2500, -2.8281, -0.3203,  2.4375, -1.5703]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1875, -4.9688,  0.1963,  0.1934, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:20:01,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:20:01,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.91 | bwd_microstep: 1161.78 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1160.78 | step_microstep: 1.98
[2025-11-06 18:20:01,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.79 | bwd: 1162.55 | bwd_inner: 1.62 | bwd_allreduce: 1160.82 | step: 2.05
 41%|████▏     | 1448/3507 [35:15<49:15,  1.44s/it]                                                   {'loss': 0.5834, 'learning_rate': 1.3251362726191784e-05, 'epoch': 0.41}
 41%|████▏     | 1448/3507 [35:15<49:15,  1.44s/it]tensor([[-2.0781, -2.5625, -1.2578,  2.8906, -0.2930]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:20:01,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.30 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.5938, -3.5000, -0.3086,  1.2344, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -3.6094,  0.0400,  2.7500, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -3.5312, -0.0874,  1.7656, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -4.4375, -1.2891,  1.8359, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -4.5625, -0.5859,  2.3594, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8750, -4.3125,  0.8867,  0.3184, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -3.5312, -0.0085,  0.3867, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:20:06,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.17 | optimizer_step: 0.22
[2025-11-06 18:20:06,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.08 | bwd_microstep: 4211.36 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 4210.04 | step_microstep: 2.48
[2025-11-06 18:20:06,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.41 | bwd: 4212.39 | bwd_inner: 2.15 | bwd_allreduce: 4210.10 | step: 2.58
 41%|████▏     | 1449/3507 [35:19<1:21:19,  2.37s/it]                                                     {'loss': 0.6501, 'learning_rate': 1.3242625965261102e-05, 'epoch': 0.41}
 41%|████▏     | 1449/3507 [35:19<1:21:19,  2.37s/it]tensor([[-4.7812, -4.5625, -1.7891,  1.7734, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -4.2812,  1.0469,  0.6367, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:06,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.07 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-4.8438, -3.0781,  0.6680,  0.8008, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -3.0625,  1.6016,  0.6953, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -3.2344,  0.4863,  0.2734, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -0.5312,  3.4375, -2.5000, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0156, -2.0000, -0.6289,  2.4375, -0.4785]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-5.4688, -2.0156,  2.9219, -0.6172, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:20:06,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:20:06,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.95 | bwd_microstep: 42.33 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 41.46 | step_microstep: 1.54
[2025-11-06 18:20:06,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.04 | bwd: 43.08 | bwd_inner: 1.48 | bwd_allreduce: 41.48 | step: 1.61
 41%|████▏     | 1450/3507 [35:20<1:02:16,  1.82s/it]                                                     {'loss': 1.308, 'learning_rate': 1.3233886437495132e-05, 'epoch': 0.41}
 41%|████▏     | 1450/3507 [35:20<1:02:16,  1.82s/it]tensor([[-5.0625, -4.7812, -1.5859,  2.2031, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.2656,  0.4336,  1.6484, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4375, -4.8125,  0.2754,  2.0312, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:06,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.85 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.3125, -4.5625, -2.7812,  0.9766, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -3.8281, -1.0703,  2.5156, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -3.6875,  0.1826,  0.4355, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -3.7969,  1.3516, -0.3457, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -4.3125, -0.9141,  1.6406, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:20:08,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:20:08,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.70 | bwd_microstep: 1285.67 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1284.63 | step_microstep: 2.00
[2025-11-06 18:20:08,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.57 | bwd: 1286.46 | bwd_inner: 1.66 | bwd_allreduce: 1284.66 | step: 2.07
 41%|████▏     | 1451/3507 [35:22<1:00:39,  1.77s/it]                                                     {'loss': 0.5441, 'learning_rate': 1.3225144150351042e-05, 'epoch': 0.41}
 41%|████▏     | 1451/3507 [35:22<1:00:39,  1.77s/it]tensor([[-5.8438, -3.0781,  2.0156,  0.8594, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.8438, -0.2988,  1.9531, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -4.0312, -1.6562,  2.6719, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -3.2188,  0.3184,  2.4844, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:08,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.49 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3750, -0.8711,  3.0156, -1.3594, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.7656, -3.8125, -1.5703,  2.3438, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -2.3125,  2.2969, -1.2266, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -2.6094,  2.1406, -1.8594, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:08,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:20:08,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.76 | bwd_microstep: 34.75 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 33.60 | step_microstep: 1.90
[2025-11-06 18:20:08,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.27 | bwd: 35.57 | bwd_inner: 1.81 | bwd_allreduce: 33.64 | step: 1.97
 41%|████▏     | 1452/3507 [35:22<46:44,  1.36s/it]                                                     {'loss': 0.5711, 'learning_rate': 1.3216399111288372e-05, 'epoch': 0.41}
 41%|████▏     | 1452/3507 [35:22<46:44,  1.36s/it]tensor([[-6.9688, -4.6562,  0.2910,  0.2266, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:08,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.44 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[0.2324, 1.3438, 3.9375, 5.5625, 1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.0312, -0.0757,  0.8320, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -4.2500, -1.3828,  2.2969, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -4.5625, -0.6328,  1.5781, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2656,  1.2812,  3.8438, -1.4062, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.0625, -5.9375, -2.6250,  1.6875, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2969,  1.0703,  3.5781, -1.0703, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:20:09,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:20:09,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.00 | bwd_microstep: 274.25 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 273.13 | step_microstep: 2.12
[2025-11-06 18:20:09,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.45 | bwd: 275.16 | bwd_inner: 1.83 | bwd_allreduce: 273.18 | step: 2.20
 41%|████▏     | 1453/3507 [35:23<38:50,  1.13s/it]                                                   {'loss': 0.8036, 'learning_rate': 1.3207651327768994e-05, 'epoch': 0.41}
 41%|████▏     | 1453/3507 [35:23<38:50,  1.13s/it]tensor([[-2.2031,  1.2812,  3.8438, -1.3828, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -5.1250, -0.5938,  1.7969, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:09,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.03 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.0000, -3.0781,  2.0000,  0.0420, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -3.2031,  0.7500,  1.3516, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.4375, -6.7812, -1.4453,  0.5000, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -3.1094,  0.3281,  0.5625, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -2.4219,  2.5000, -0.4766, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -3.3750,  1.1797,  1.4922, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:12,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:20:12,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.83 | bwd_microstep: 2052.02 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 2050.86 | step_microstep: 2.88
[2025-11-06 18:20:12,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.88 | bwd: 2053.05 | bwd_inner: 2.02 | bwd_allreduce: 2050.91 | step: 2.96
 41%|████▏     | 1454/3507 [35:26<58:15,  1.70s/it]                                                   {'loss': 0.6246, 'learning_rate': 1.3198900807257129e-05, 'epoch': 0.41}
 41%|████▏     | 1454/3507 [35:26<58:15,  1.70s/it]tensor([[-7.1562, -4.0938,  0.2617, -2.5000, -5.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2656,  2.0156,  3.4219, -1.5625, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:20:12,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.71 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3594, -3.5312, -0.9492,  3.4844, -1.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -3.0938,  0.2891, -0.9258, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.2188,  1.1172, -1.2578, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.5195,  2.5312,  3.7031, -0.7305, -1.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6875, -4.5938, -0.5430,  1.6094, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -3.5625, -0.6484,  2.7188, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:20:12,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:20:12,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.66 | bwd_microstep: 163.13 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 161.92 | step_microstep: 1.65
[2025-11-06 18:20:12,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.40 | bwd: 163.99 | bwd_inner: 1.89 | bwd_allreduce: 161.96 | step: 1.73
 41%|████▏     | 1455/3507 [35:26<46:45,  1.37s/it]                                                   {'loss': 0.7711, 'learning_rate': 1.319014755721934e-05, 'epoch': 0.41}
 41%|████▏     | 1455/3507 [35:26<46:45,  1.37s/it]tensor([[-4.7812, -4.0312, -0.1846,  3.1719, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -1.6719,  2.1719,  0.7539, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:13,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.64 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.7188, -3.4219,  0.5195, -0.6289, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -4.2500, -0.2891,  1.4688, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -2.1094,  1.2969,  0.9844, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7891,  0.1328,  2.9531,  2.0781, -1.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -4.8438, -0.8516, -0.5781, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -3.7656,  0.5000,  1.7656, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:14,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:20:14,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.13 | bwd_microstep: 971.56 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 970.54 | step_microstep: 2.02
[2025-11-06 18:20:14,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.80 | bwd: 972.39 | bwd_inner: 1.68 | bwd_allreduce: 970.58 | step: 2.10
 42%|████▏     | 1456/3507 [35:28<46:39,  1.36s/it]                                                   {'loss': 0.4695, 'learning_rate': 1.3181391585124503e-05, 'epoch': 0.42}
 42%|████▏     | 1456/3507 [35:28<46:39,  1.36s/it]tensor([[-4.9375, -4.3750, -1.1016,  2.0469, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.4668, 1.9297, 4.1875, 4.8125, 1.0078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -1.3281,  2.4531,  0.7695, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:14,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.10 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0312, -3.1875, -1.3828,  2.4062, -1.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -3.1250,  0.3066,  0.4688, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -1.2344,  2.5000, -0.1118, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0938, -4.5938, -0.3359,  0.8828, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -2.4375,  1.1328,  1.4375, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:17,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 18:20:17,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.26 | bwd_microstep: 2873.34 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 2872.28 | step_microstep: 2.43
[2025-11-06 18:20:17,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.39 | bwd: 2874.45 | bwd_inner: 1.96 | bwd_allreduce: 2872.33 | step: 2.52
 42%|████▏     | 1457/3507 [35:31<1:06:08,  1.94s/it]                                                     {'loss': 0.2968, 'learning_rate': 1.3172632898443833e-05, 'epoch': 0.42}
 42%|████▏     | 1457/3507 [35:31<1:06:08,  1.94s/it]tensor([[-4.7812, -0.8438,  3.9375, -0.8242, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.2344,  1.8516,  0.0791, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8906, -3.2188, -1.9844,  1.3828, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:20:17,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.78 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1250, -4.2500, -1.5078,  3.0312, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -1.6641,  2.2188, -0.2451, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8125, -3.8594, -1.6250,  1.7969, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6406,  0.7188,  2.7969, -2.3594, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2500, -3.5938,  0.2412,  0.8359, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:18,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:20:18,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.62 | bwd_microstep: 243.79 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 242.70 | step_microstep: 1.94
[2025-11-06 18:20:18,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.43 | bwd: 244.47 | bwd_inner: 1.58 | bwd_allreduce: 242.74 | step: 2.04
 42%|████▏     | 1458/3507 [35:31<52:30,  1.54s/it]                                                     {'loss': 0.7136, 'learning_rate': 1.3163871504650851e-05, 'epoch': 0.42}
 42%|████▏     | 1458/3507 [35:31<52:30,  1.54s/it]tensor([[-4.8125, -2.6406,  0.7266, -0.1738, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.8125,  0.2344,  1.4375, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:18,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.55 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.4844, -0.4707,  2.7656, -0.5391, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -4.4688,  0.5430,  1.8359, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6953,  1.1484,  2.6719, -1.7031, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4688, -2.2656,  1.6719,  1.3281, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8750, -5.9062, -1.9062,  0.6406, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -1.0078,  3.5938, -0.6016, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:19,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:20:19,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.86 | bwd_microstep: 478.86 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 477.83 | step_microstep: 1.47
[2025-11-06 18:20:19,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.43 | bwd: 479.61 | bwd_inner: 1.63 | bwd_allreduce: 477.86 | step: 1.54
 42%|████▏     | 1459/3507 [35:32<46:04,  1.35s/it]                                                   {'loss': 0.8905, 'learning_rate': 1.315510741122139e-05, 'epoch': 0.42}
 42%|████▏     | 1459/3507 [35:32<46:04,  1.35s/it]tensor([[-3.6562, -0.7344,  2.3594, -0.7812, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -1.4453,  2.7188, -0.9844, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -3.6719,  0.2871,  2.5469, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.8125, -0.6211,  1.9766, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:19,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.40 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3594,  0.7578,  3.0000, -1.0625, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4375, -4.0000, -0.5859,  3.0312, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[ 0.2344,  3.6094,  5.9375,  1.2188, -0.4180]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -2.9062,  1.0547,  0.3008, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:20,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:20:20,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.33 | bwd_microstep: 542.73 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 541.65 | step_microstep: 2.06
[2025-11-06 18:20:20,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 514.76 | bwd: 543.71 | bwd_inner: 1.89 | bwd_allreduce: 541.69 | step: 2.14
 42%|████▏     | 1460/3507 [35:34<44:00,  1.29s/it]                                                   {'loss': 0.8559, 'learning_rate': 1.3146340625633594e-05, 'epoch': 0.42}
 42%|████▏     | 1460/3507 [35:34<44:00,  1.29s/it]tensor([[-4.4688, -1.2266,  3.0938, -0.3926, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906,  0.1211,  3.6094, -2.0781, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:20,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.62 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9375, -3.6562, -1.0703,  2.0000, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -2.5156,  0.9219,  1.6172, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -3.8906, -0.3145,  2.0312, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -1.8281,  2.4688, -0.1133, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -4.7188, -0.6914,  2.3750, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -3.9688,  0.2051,  0.9609, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:20,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.31 | optimizer_step: 0.28
[2025-11-06 18:20:20,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.30 | bwd_microstep: 2.77 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 1.22 | step_microstep: 2.52
[2025-11-06 18:20:20,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.93 | bwd: 3.60 | bwd_inner: 2.13 | bwd_allreduce: 1.26 | step: 2.60
 42%|████▏     | 1461/3507 [35:34<37:08,  1.09s/it]                                                   {'loss': 0.2384, 'learning_rate': 1.313757115536789e-05, 'epoch': 0.42}
 42%|████▏     | 1461/3507 [35:34<37:08,  1.09s/it]tensor([[-3.5938, -2.0781,  1.0000,  1.4297, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:21,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.67 | bwd_microstep: 5.55 | bwd_inner_microstep: 5.39 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.2812,  0.7695,  3.9531, -2.4375, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -4.2812, -0.8945,  1.8750, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4688, -3.7812, -2.3594,  1.1562, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -2.0469,  2.2031, -0.0500, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.9844, -0.5312,  2.2969, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.3438,  1.1016,  1.5938, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7500, -5.6875, -0.3145,  0.5273, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:23,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:20:23,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.33 | bwd_microstep: 2237.65 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 2236.12 | step_microstep: 1.73
[2025-11-06 18:20:23,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.02 | bwd: 2243.20 | bwd_inner: 6.84 | bwd_allreduce: 2236.18 | step: 1.84
 42%|████▏     | 1462/3507 [35:37<52:56,  1.55s/it]                                                   {'loss': 0.5076, 'learning_rate': 1.3128799007907004e-05, 'epoch': 0.42}
 42%|████▏     | 1462/3507 [35:37<52:56,  1.55s/it]tensor([[-2.7344,  0.3613,  2.1406, -2.2812, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:20:23,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.42 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.9062, -5.7812, -1.3203,  1.1797, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -4.4375, -1.2031,  2.7500, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -3.0625,  0.3066,  3.2188, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5312,  1.7031,  3.8906, -0.8945, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.8125, -2.3906,  1.8984,  0.8789, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -4.5000, -0.4941,  1.6094, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -3.1719,  1.8750, -1.7344, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:24,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:20:24,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.52 | bwd_microstep: 183.24 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 181.80 | step_microstep: 2.12
[2025-11-06 18:20:24,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.98 | bwd: 184.19 | bwd_inner: 2.18 | bwd_allreduce: 181.84 | step: 2.20
 42%|████▏     | 1463/3507 [35:37<42:39,  1.25s/it]                                                   {'loss': 0.6154, 'learning_rate': 1.3120024190735952e-05, 'epoch': 0.42}
 42%|████▏     | 1463/3507 [35:37<42:39,  1.25s/it]tensor([[-3.9688, -3.7344, -0.8047,  3.0156, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -2.7188,  0.8359, -3.5156, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -3.4844,  0.2969,  0.1865, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -2.6875,  1.2109,  1.8828, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.8125, -5.6562, -0.6523, -0.4258, -5.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -1.9062,  3.0469, -0.9883, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:24,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.83 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.9688, -3.4688,  0.6445, -0.7383, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0312, -2.1719,  1.4297, -0.9844, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:20:26,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:20:26,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 1790.85 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1789.76 | step_microstep: 2.07
[2025-11-06 18:20:26,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.17 | bwd: 1791.98 | bwd_inner: 2.05 | bwd_allreduce: 1789.80 | step: 2.14
 42%|████▏     | 1464/3507 [35:40<57:38,  1.69s/it]                                                   {'loss': 0.6536, 'learning_rate': 1.3111246711342016e-05, 'epoch': 0.42}
 42%|████▏     | 1464/3507 [35:40<57:38,  1.69s/it]tensor([[-3.8906, -0.3926,  3.4219, -0.6875, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -1.8125,  1.6875,  0.0125, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:26,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.07 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8750, -2.9219,  1.6484, -0.2695, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -1.3828,  2.4375,  0.7266, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9219, -2.0625,  0.9180,  3.2500, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4844,  0.6328,  2.8125, -1.5938, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -1.1094,  2.9531, -0.6523, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -2.6719,  1.2969, -0.3281, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:20:27,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:20:27,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 40.91 | bwd_microstep: 192.09 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 191.17 | step_microstep: 1.88
[2025-11-06 18:20:27,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 219.00 | bwd: 192.92 | bwd_inner: 1.58 | bwd_allreduce: 191.21 | step: 1.96
 42%|████▏     | 1465/3507 [35:41<44:49,  1.32s/it]                                                   {'loss': 0.3718, 'learning_rate': 1.3102466577214756e-05, 'epoch': 0.42}
 42%|████▏     | 1465/3507 [35:41<44:49,  1.32s/it]tensor([[-3.9219, -0.7500,  2.1094, -1.9219, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -2.3906,  1.2031, -0.0645, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -1.0859,  2.8906, -1.6484, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -3.7344, -0.5664,  2.3750, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -4.1250, -1.8281,  2.4219, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -4.7500, -1.4141,  1.8203, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -4.1562,  0.2285,  1.7578, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:28,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.33 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -2.0938,  1.7891, -0.4629, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:29,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:20:29,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.97 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.23
[2025-11-06 18:20:29,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 519.31 | bwd: 2.93 | bwd_inner: 1.90 | bwd_allreduce: 0.90 | step: 2.31
 42%|████▏     | 1466/3507 [35:42<50:54,  1.50s/it]                                                   {'loss': 0.3883, 'learning_rate': 1.3093683795845999e-05, 'epoch': 0.42}
 42%|████▏     | 1466/3507 [35:42<50:54,  1.50s/it]tensor([[-1.1797,  2.0312,  2.7031, -2.2344, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750, -0.0128,  2.9688, -1.2812, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.9688,  0.5039,  2.1406, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:29,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.95 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -3.3750, -0.0156,  2.4062, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719, -2.5312, -1.3906,  2.1719, -0.4961]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -3.8594, -0.2910,  2.2969, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -2.1562,  1.4062, -0.3320, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -2.7344,  1.8359,  0.7539, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:30,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:20:30,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.35 | bwd_microstep: 638.56 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 637.43 | step_microstep: 1.98
[2025-11-06 18:20:30,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.31 | bwd: 639.70 | bwd_inner: 2.10 | bwd_allreduce: 637.48 | step: 2.07
 42%|████▏     | 1467/3507 [35:43<45:35,  1.34s/it]                                                   {'loss': 0.8512, 'learning_rate': 1.3084898374729826e-05, 'epoch': 0.42}
 42%|████▏     | 1467/3507 [35:43<45:35,  1.34s/it]tensor([[-4.7500, -3.3906,  0.3105,  1.8750, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -2.0312,  2.2031, -1.3672, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7656, -0.9648,  2.2344,  2.3906, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -3.6094, -0.1040, -2.2656, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6719,  1.0469,  4.1250, -1.1016, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -2.6250,  0.9688,  2.1719, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:31,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.49 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3750, -3.6562,  0.2559,  0.9805, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9844, -3.2812, -0.4121,  1.6094, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:20:31,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:20:31,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.01 | bwd_microstep: 587.85 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 586.59 | step_microstep: 2.17
[2025-11-06 18:20:31,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.53 | bwd: 588.59 | bwd_inner: 1.84 | bwd_allreduce: 586.63 | step: 2.25
 42%|████▏     | 1468/3507 [35:45<49:31,  1.46s/it]                                                   {'loss': 0.4935, 'learning_rate': 1.3076110321362576e-05, 'epoch': 0.42}
 42%|████▏     | 1468/3507 [35:45<49:31,  1.46s/it]tensor([[-3.7344, -1.4297,  1.9062,  0.3848, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -3.9375, -0.7578,  3.1250, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:31,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.80 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5156, -2.2344,  0.5117,  1.3672, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -4.5000, -1.1250,  3.1562, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781,  0.6797,  3.9531, -2.6719, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -3.2344,  0.8125,  1.7812, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2734,  1.2734,  2.7344,  0.0869, -1.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.1250, -0.0198,  1.2422, -0.6406, -1.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:33,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:20:33,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.77 | bwd_microstep: 1535.38 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1534.36 | step_microstep: 1.95
[2025-11-06 18:20:33,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.59 | bwd: 1536.08 | bwd_inner: 1.53 | bwd_allreduce: 1534.40 | step: 2.03
 42%|████▏     | 1469/3507 [35:47<53:11,  1.57s/it]                                                   {'loss': 0.402, 'learning_rate': 1.3067319643242829e-05, 'epoch': 0.42}
 42%|████▏     | 1469/3507 [35:47<53:11,  1.57s/it]tensor([[-4.0000, -2.3906,  1.3594,  2.0781, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:33,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.24 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6875, -3.0469,  1.0859,  2.2344, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5547,  1.5703,  2.8594, -1.6172, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000,  0.9922,  2.9062, -1.9141, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.8438, -3.0938,  1.8281,  0.6133, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5625,  0.1465,  2.8281,  0.1206, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.8281, -0.7812,  3.0938, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -3.9531,  0.9180,  1.9375, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:20:34,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.11 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:20:34,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.03 | bwd_microstep: 155.97 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 154.99 | step_microstep: 1.44
[2025-11-06 18:20:34,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.31 | bwd: 156.75 | bwd_inner: 1.61 | bwd_allreduce: 155.02 | step: 1.51
 42%|████▏     | 1470/3507 [35:47<42:16,  1.25s/it]                                                   {'loss': 0.7046, 'learning_rate': 1.3058526347871407e-05, 'epoch': 0.42}
 42%|████▏     | 1470/3507 [35:47<42:16,  1.25s/it]tensor([[-6.0312, -3.6562,  0.9805,  0.5469, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -3.5312,  0.7930,  1.0469, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:34,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.04 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5625, -0.1533,  2.6406, -1.7031, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.1250, -5.0938, -2.2969,  1.7422, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -3.3906,  0.2256, -0.4355, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.3750, -5.8438, -2.1406, -1.3359, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7500, -3.8125,  1.1094, -0.7891, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -2.4844,  1.4062,  0.7109, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:37,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.19 | optimizer_step: 0.30
[2025-11-06 18:20:37,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.42 | bwd_microstep: 3046.79 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 3045.73 | step_microstep: 2.52
[2025-11-06 18:20:37,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.47 | bwd: 3047.69 | bwd_inner: 1.78 | bwd_allreduce: 3045.78 | step: 2.60
 42%|████▏     | 1471/3507 [35:51<1:03:58,  1.89s/it]                                                     {'loss': 0.73, 'learning_rate': 1.3049730442751362e-05, 'epoch': 0.42}
 42%|████▏     | 1471/3507 [35:51<1:03:58,  1.89s/it]tensor([[-5.0000, -3.8750, -0.3477,  1.9922, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:37,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.92 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.6875, -1.6719,  2.1406, -0.4199, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.6562, -1.0234,  2.4688, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -2.7344,  2.1562,  0.0713, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4531, -2.0781,  0.4102,  3.4531, -0.8164]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.5781,  0.0796,  1.9062, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7812, -3.3750, -0.3945,  2.9219, -1.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -3.2500, -0.1465,  2.1719, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:38,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:20:38,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.12 | bwd_microstep: 172.45 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 171.32 | step_microstep: 2.23
[2025-11-06 18:20:38,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.08 | bwd: 173.44 | bwd_inner: 1.93 | bwd_allreduce: 171.37 | step: 2.32
 42%|████▏     | 1472/3507 [35:51<50:40,  1.49s/it]                                                     {'loss': 0.4127, 'learning_rate': 1.304093193538798e-05, 'epoch': 0.42}
 42%|████▏     | 1472/3507 [35:51<50:40,  1.49s/it]tensor([[-2.9844,  0.6328,  3.1719, -2.0156, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:20:38,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.71 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.5938, -3.6250, -0.2812,  1.8516, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5781,  0.8633,  3.1094, -1.6641, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438, -3.6875, -1.6406,  2.5938, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625, -3.2031, -0.8750,  3.2656, -1.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -1.0234,  2.2344, -1.5156, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1562, -5.9062, -1.6953,  0.4297, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.8750, -0.0991,  1.9766, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:40,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.32 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:20:40,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.72 | bwd_microstep: 1822.46 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 1821.48 | step_microstep: 3.73
[2025-11-06 18:20:40,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.44 | bwd: 1823.35 | bwd_inner: 1.66 | bwd_allreduce: 1821.54 | step: 3.82
 42%|████▏     | 1473/3507 [35:54<57:47,  1.70s/it]                                                   {'loss': 0.6732, 'learning_rate': 1.303213083328876e-05, 'epoch': 0.42}
 42%|████▏     | 1473/3507 [35:54<57:47,  1.70s/it]tensor([[-3.7969, -3.2969, -0.7812,  1.8359, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:40,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.18 | bwd_microstep: 1.67 | bwd_inner_microstep: 1.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2188, -3.3125, -0.2695,  1.6641, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8281, -0.2910,  1.9688, -0.1621, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9062, -3.9375, -1.4531,  2.9375, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3594, -1.4609, -0.0576,  3.2656,  0.1108]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -4.3125, -0.5977,  2.4531, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -3.0781, -0.0064,  2.1250, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -1.5391,  2.7344, -1.3281, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:40,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:20:40,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.14 | bwd_microstep: 1.47 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.60 | step_microstep: 3.24
[2025-11-06 18:20:40,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.34 | bwd: 3.14 | bwd_inner: 2.40 | bwd_allreduce: 0.62 | step: 3.31
 42%|████▏     | 1474/3507 [35:54<44:37,  1.32s/it]                                                   {'loss': 0.1006, 'learning_rate': 1.3023327143963415e-05, 'epoch': 0.42}
 42%|████▏     | 1474/3507 [35:54<44:37,  1.32s/it]tensor([[-2.5469,  0.4395,  2.5469, -1.3828, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9141,  1.8984,  2.6562, -1.3672, -1.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.1250, -4.4375, -1.2031,  1.6719, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:40,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.14 | bwd_microstep: 0.62 | bwd_inner_microstep: 0.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -4.0625,  0.1367,  1.2812, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -0.2139,  3.8906, -1.6484, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -1.9844,  1.6250,  1.7578, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3750, -2.0312,  2.7500,  0.0281, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -3.9219,  0.9375,  1.6484, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:43,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.53 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:20:43,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.06 | bwd_microstep: 2672.29 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 2671.49 | step_microstep: 4.02
[2025-11-06 18:20:43,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.22 | bwd: 2672.91 | bwd_inner: 1.23 | bwd_allreduce: 2671.53 | step: 4.10
 42%|████▏     | 1475/3507 [35:57<1:01:44,  1.82s/it]                                                     {'loss': 0.3701, 'learning_rate': 1.3014520874923877e-05, 'epoch': 0.42}
 42%|████▏     | 1475/3507 [35:57<1:01:44,  1.82s/it]tensor([[-4.7188, -4.2188, -0.9727,  2.2344, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:43,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.05 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.2188, -3.8906,  0.1211,  2.0312, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9453,  1.6250,  3.6094, -1.5703, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -3.8750, -0.3652,  1.7344, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5781,  1.4922,  2.6875, -1.6562, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.6250, -3.9219,  0.0850,  0.8945, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.2188,  0.0879,  1.8281, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -4.4062,  0.5664,  1.0391, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:44,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:20:44,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.21 | bwd_microstep: 1.80 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.91
[2025-11-06 18:20:44,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 430.27 | bwd: 2.86 | bwd_inner: 1.83 | bwd_allreduce: 0.85 | step: 2.03
 42%|████▏     | 1476/3507 [35:58<48:08,  1.42s/it]                                                     {'loss': 0.4334, 'learning_rate': 1.3005712033684263e-05, 'epoch': 0.42}
 42%|████▏     | 1476/3507 [35:58<48:08,  1.42s/it]tensor([[-5.2500, -2.8906,  0.6836, -0.4355, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -4.5000, -0.1738,  2.7656, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2500, -5.0938, -0.7734,  1.9766, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -3.8125,  0.7773,  1.5781, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:44,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.56 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.5469, -3.6719, -1.2188,  2.8281, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -2.5781,  0.9023,  2.3594, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -1.0938,  2.3906, -0.0747, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -0.3848,  3.4844, -0.3184, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:46,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.25 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:20:46,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.08 | bwd_microstep: 2148.90 | bwd_inner_microstep: 1.57 | bwd_allreduce_microstep: 2147.21 | step_microstep: 4.08
[2025-11-06 18:20:46,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.66 | bwd: 2150.02 | bwd_inner: 2.57 | bwd_allreduce: 2147.28 | step: 4.18
 42%|████▏     | 1477/3507 [36:00<59:07,  1.75s/it]                                                   {'loss': 0.3931, 'learning_rate': 1.2996900627760897e-05, 'epoch': 0.42}
 42%|████▏     | 1477/3507 [36:00<59:07,  1.75s/it]tensor([[-5.0000, -1.9453,  1.5703, -1.3594, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -4.9688, -1.5859,  2.2969, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -2.0469,  1.6250,  0.5547, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:46,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.95 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.3594, -0.2969,  1.7812,  0.1914, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -4.4375, -1.1719,  3.3125, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -3.6250, -1.5469,  2.0469, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-4.5938, -4.1250, -1.3047,  1.4844, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -3.2656, -0.5508,  3.4062, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:20:47,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:20:47,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.95 | bwd_microstep: 31.35 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 30.37 | step_microstep: 3.45
[2025-11-06 18:20:47,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.93 | bwd: 32.41 | bwd_inner: 1.81 | bwd_allreduce: 30.43 | step: 3.55
 42%|████▏     | 1478/3507 [36:00<45:41,  1.35s/it]                                                   {'loss': 0.7097, 'learning_rate': 1.2988086664672285e-05, 'epoch': 0.42}
 42%|████▏     | 1478/3507 [36:00<45:41,  1.35s/it]tensor([[-3.3125, -3.4531, -1.7031,  1.7969, -1.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -4.6562, -1.6172,  1.7656, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -1.9688,  0.7305, -1.3750, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.7812, -0.2344,  3.8906, -0.0825, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:47,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.11 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.5312, -2.9531,  0.8516,  2.1094, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -4.2188, -0.6289,  3.3281, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-9.8750, -9.3750, -4.9375, -0.9805, -6.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -2.7812,  0.4707,  1.6172, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:50,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:20:50,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 3114.51 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 3113.51 | step_microstep: 2.39
[2025-11-06 18:20:50,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 478.81 | bwd: 3115.56 | bwd_inner: 1.82 | bwd_allreduce: 3113.58 | step: 2.50
 42%|████▏     | 1479/3507 [36:04<1:08:54,  2.04s/it]                                                     {'loss': 0.7571, 'learning_rate': 1.2979270151939116e-05, 'epoch': 0.42}
 42%|████▏     | 1479/3507 [36:04<1:08:54,  2.04s/it]tensor([[-3.2188,  0.6680,  3.7812, -1.2812, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:20:50,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.86 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8906, -2.6562,  0.4570,  1.8516, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.7188, -1.3672,  1.3906, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -1.9375,  2.3594, -0.7344, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -4.2812, -1.7109,  2.7188, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -0.8242,  3.2969, -0.8789, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4219, -0.1514,  2.6562, -1.4219, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.1875, -3.4375,  0.0752,  3.1406, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:20:51,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:20:51,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.61 | bwd_microstep: 77.51 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 76.57 | step_microstep: 2.16
[2025-11-06 18:20:51,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 457.49 | bwd: 78.44 | bwd_inner: 1.70 | bwd_allreduce: 76.61 | step: 2.25
 42%|████▏     | 1480/3507 [36:05<54:02,  1.60s/it]                                                     {'loss': 0.4282, 'learning_rate': 1.2970451097084258e-05, 'epoch': 0.42}
 42%|████▏     | 1480/3507 [36:05<54:02,  1.60s/it]tensor([[-5.3750, -3.7188,  0.1670,  1.1484, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -1.9453,  1.3281,  2.2812, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:51,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.83 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -3.3438,  0.0076,  2.3125, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5938,  1.1094,  3.3750, -2.0469, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -5.0000, -0.9375,  2.2812, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5781, -3.0781, -1.4922,  2.7656, -0.6523]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1562, -6.0000, -3.1719,  0.4727, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -1.5703,  2.6250,  0.2500, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:57,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:20:57,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.47 | bwd_microstep: 5347.89 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 5346.70 | step_microstep: 2.34
[2025-11-06 18:20:57,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.33 | bwd: 5348.74 | bwd_inner: 1.85 | bwd_allreduce: 5346.75 | step: 2.42
 42%|████▏     | 1481/3507 [36:10<1:35:24,  2.83s/it]                                                     {'loss': 0.1421, 'learning_rate': 1.2961629507632743e-05, 'epoch': 0.42}
 42%|████▏     | 1481/3507 [36:10<1:35:24,  2.83s/it]tensor([[-2.2656,  1.1094,  2.8750, -2.1719, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -3.6250,  0.1885,  1.7422, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.9688,  0.5352,  1.6484, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:57,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.23 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[-3.6562, -3.1719, -0.0483,  3.4219, -1.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.4531, -0.0894,  2.9688, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -3.0000,  0.0256,  2.3594, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -0.6914,  3.1250, -2.0625, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -0.6836,  2.3750, -0.1924, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:20:57,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.13 | optimizer_step: 0.13
[2025-11-06 18:20:57,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.82 | bwd_microstep: 85.33 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 84.48 | step_microstep: 1.75
[2025-11-06 18:20:57,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.07 | bwd: 86.17 | bwd_inner: 1.51 | bwd_allreduce: 84.52 | step: 1.81
 42%|████▏     | 1482/3507 [36:11<1:12:51,  2.16s/it]                                                     {'loss': 0.1284, 'learning_rate': 1.2952805391111767e-05, 'epoch': 0.42}
 42%|████▏     | 1482/3507 [36:11<1:12:51,  2.16s/it]tensor([[-4.2812, -1.5078,  2.7188,  1.4141, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -3.2812, -0.2891,  3.0312, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5781, -1.2109,  1.1641,  4.2500, -0.1113]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:20:57,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.04 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.04
tensor([[-5.0000, -3.2188,  0.4395,  0.8242, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656,  0.4746,  3.4844, -1.6094, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9375, -3.6719, -0.0461, -1.1719, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2188, -1.4922,  1.9297, -0.0608, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -1.8750,  2.3594, -0.6758, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:21:02,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:21:02,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.23 | bwd_microstep: 4344.65 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 4343.84 | step_microstep: 3.27
[2025-11-06 18:21:02,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.28 | bwd: 4345.26 | bwd_inner: 1.28 | bwd_allreduce: 4343.88 | step: 3.31
 42%|████▏     | 1483/3507 [36:16<1:39:01,  2.94s/it]                                                     {'loss': 0.6884, 'learning_rate': 1.2943978755050688e-05, 'epoch': 0.42}
 42%|████▏     | 1483/3507 [36:16<1:39:01,  2.94s/it]tensor([[-5.6562, -2.9062,  1.2188, -0.1172, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -3.2656,  1.3203,  1.8125, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3906,  1.2734,  3.1094, -2.4219, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3438, -0.9219,  2.0156,  2.9531, -1.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:02,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.13 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.9375, -3.4375, -0.0238, -1.6953, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -1.9453,  1.6016, -0.2891, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -5.7812, -1.8750,  1.1719, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -2.2812,  0.8125,  1.4453, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:02,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:21:02,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.81 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.49
[2025-11-06 18:21:02,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.96 | bwd: 2.80 | bwd_inner: 1.84 | bwd_allreduce: 0.85 | step: 1.55
 42%|████▏     | 1484/3507 [36:16<1:13:35,  2.18s/it]                                                     {'loss': 0.4077, 'learning_rate': 1.2935149606981008e-05, 'epoch': 0.42}
 42%|████▏     | 1484/3507 [36:16<1:13:35,  2.18s/it]tensor([[-3.2969,  0.3535,  3.5469, -0.7930, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -5.0938, -0.7383,  2.6562, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -3.5938, -0.5312,  2.3281, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -1.3438,  1.5000, -0.7500, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:03,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.49 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5000, -1.2188,  1.5625,  2.8906, -1.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -4.6250, -1.9219,  2.3281, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5312, -6.6875, -2.7188,  0.3887, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -3.2188,  0.4199,  2.5469, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:21:03,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:21:03,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 717.81 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 716.23 | step_microstep: 1.83
[2025-11-06 18:21:03,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.05 | bwd: 718.57 | bwd_inner: 2.17 | bwd_allreduce: 716.27 | step: 1.90
 42%|████▏     | 1485/3507 [36:17<1:03:10,  1.87s/it]                                                     {'loss': 0.2639, 'learning_rate': 1.292631795443637e-05, 'epoch': 0.42}
 42%|████▏     | 1485/3507 [36:17<1:03:10,  1.87s/it]tensor([[-3.2969, -3.2969, -1.3906,  2.0000, -1.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344, -3.0625, -0.4766,  3.2031, -1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:04,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.05 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5312, -1.5469,  1.4844, -1.4062, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -4.5938, -0.7109,  0.4609, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -1.3750,  2.8906, -0.0928, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -1.1328,  1.9219,  1.4141, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5312, -2.2969,  1.2188, -2.0938, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3438,  1.7344,  2.8906, -1.1172, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:21:04,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.26 | optimizer_step: 0.23
[2025-11-06 18:21:04,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.87 | bwd_microstep: 72.10 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 71.10 | step_microstep: 2.32
[2025-11-06 18:21:04,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.94 | bwd: 73.09 | bwd_inner: 1.78 | bwd_allreduce: 71.15 | step: 2.42
 42%|████▏     | 1486/3507 [36:18<48:12,  1.43s/it]                                                     {'loss': 0.1815, 'learning_rate': 1.2917483804952562e-05, 'epoch': 0.42}
 42%|████▏     | 1486/3507 [36:18<48:12,  1.43s/it]tensor([[-4.2500, -1.9141,  1.5859,  0.7812, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250,  0.6875,  3.4219, -1.5078, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -3.9531, -1.0078,  3.4688, -1.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156, -2.6875, -1.1953,  2.2969, -0.7695]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:04,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.03 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1250, -1.5078,  3.1719, -0.2852, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -0.4453,  1.8438, -0.9453, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -2.6094,  0.9023,  2.5469, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6250, -3.4688, -1.0156,  2.4062, -1.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:21:06,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:21:06,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.76 | bwd_microstep: 2171.75 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2170.63 | step_microstep: 2.14
[2025-11-06 18:21:06,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.81 | bwd: 2172.51 | bwd_inner: 1.70 | bwd_allreduce: 2170.68 | step: 2.21
 42%|████▏     | 1487/3507 [36:20<59:44,  1.77s/it]                                                   {'loss': 0.1279, 'learning_rate': 1.2908647166067496e-05, 'epoch': 0.42}
 42%|████▏     | 1487/3507 [36:20<59:44,  1.77s/it]tensor([[-6.0000, -2.6094,  1.7734, -1.2422, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1406,  1.5781,  2.3750, -1.3594, -1.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -3.0625,  1.3984,  1.2266, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:21:07,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.89 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([3], device='cuda:1')
tensor([[-5.4062, -4.6250, -0.8398,  2.3750, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7812, -0.3457,  2.7812,  1.2500, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7891,  1.8516,  4.1562, -0.8164, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -1.7578,  1.9922,  0.4453, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6250, -2.6094, -0.5586,  2.9844, -0.8398]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:21:07,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:21:07,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 9.77 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 8.59 | step_microstep: 1.59
[2025-11-06 18:21:07,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.50 | bwd: 10.87 | bwd_inner: 2.11 | bwd_allreduce: 8.64 | step: 1.67
 42%|████▏     | 1488/3507 [36:21<45:57,  1.37s/it]                                                   {'loss': 0.4341, 'learning_rate': 1.2899808045321208e-05, 'epoch': 0.42}
 42%|████▏     | 1488/3507 [36:21<45:57,  1.37s/it]tensor([[-5.5938, -2.7969,  1.6953,  0.2031, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -3.6406,  1.0859,  1.2812, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -2.7656,  0.8516,  2.7344, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:07,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.49 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -1.8750,  2.7500, -0.1797, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -2.9062,  0.5820,  2.1719, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750, -1.6172,  0.9531,  0.2871, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -2.2344,  1.5234, -0.5469, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -3.7344,  0.2754,  0.5469, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:21:08,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:21:08,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.49 | bwd_microstep: 276.87 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 275.66 | step_microstep: 2.26
[2025-11-06 18:21:08,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 455.01 | bwd: 277.74 | bwd_inner: 1.88 | bwd_allreduce: 275.70 | step: 2.34
 42%|████▏     | 1489/3507 [36:22<41:43,  1.24s/it]                                                   {'loss': 0.4138, 'learning_rate': 1.2890966450255862e-05, 'epoch': 0.42}
 42%|████▏     | 1489/3507 [36:22<41:43,  1.24s/it]tensor([[-3.2188, -3.6250, -2.3906,  1.1328, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -2.9219,  1.6641,  1.6250, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -1.8828,  1.8203, -1.1562, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.5000, -0.2598,  1.9766, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -2.7344, -0.4824,  2.5625, -1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:09,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.98 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5312, -3.6875, -0.6523,  1.7109, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -3.9062, -0.2773,  1.4219, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4375, -3.1875,  1.0156,  0.7500, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:21:10,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:21:10,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.10 | bwd_microstep: 171.60 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 170.44 | step_microstep: 2.04
[2025-11-06 18:21:10,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.10 | bwd: 172.53 | bwd_inner: 1.91 | bwd_allreduce: 170.48 | step: 2.12
 42%|████▏     | 1490/3507 [36:24<49:04,  1.46s/it]                                                   {'loss': 0.2477, 'learning_rate': 1.2882122388415716e-05, 'epoch': 0.42}
 42%|████▏     | 1490/3507 [36:24<49:04,  1.46s/it]tensor([[-5.2500, -3.7188, -0.0099,  1.3750, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:10,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.30 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5000, -3.6250, -0.3809,  2.1562, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5312, -2.9531,  2.2188, -0.6484, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -4.8125, -1.1094,  2.3594, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219, -2.9844,  0.4219,  3.0469, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -3.5938,  1.3359,  0.5078, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[1.9688, 3.6875, 5.5312, 5.4375, 2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -2.7656,  2.0781,  0.1641, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:21:11,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:21:11,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.66 | bwd_microstep: 689.98 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 688.86 | step_microstep: 2.02
[2025-11-06 18:21:11,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.00 | bwd: 690.98 | bwd_inner: 1.92 | bwd_allreduce: 688.91 | step: 2.12
 43%|████▎     | 1491/3507 [36:25<45:03,  1.34s/it]                                                   {'loss': 0.7512, 'learning_rate': 1.287327586734715e-05, 'epoch': 0.43}
 43%|████▎     | 1491/3507 [36:25<45:03,  1.34s/it]tensor([[-6.6250, -4.0625,  0.2676, -0.5742, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062, -0.0244,  2.0781, -0.2480, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7773,  1.7734,  2.3438, -0.6836, -1.0859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.9375, -2.6094,  1.1406,  0.5312, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -3.3750,  1.4531,  0.5664, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -1.7969,  1.2344,  1.1484, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -1.9219,  1.0156, -0.5234, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:13,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.56 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1250, -0.8516,  1.5156, -0.1699, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:13,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:21:13,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.18 | bwd_microstep: 1.74 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 3.08
[2025-11-06 18:21:13,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.76 | bwd: 2.62 | bwd_inner: 1.70 | bwd_allreduce: 0.79 | step: 3.20
 43%|████▎     | 1492/3507 [36:27<57:34,  1.71s/it]                                                   {'loss': 0.6733, 'learning_rate': 1.2864426894598629e-05, 'epoch': 0.43}
 43%|████▎     | 1492/3507 [36:27<57:34,  1.71s/it]tensor([[-3.7656, -1.0547,  1.8672, -0.1963, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -4.4062, -0.2490,  0.1074, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5703,  2.2969,  4.2188, -1.4609, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -0.0471,  3.2969, -2.4844, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:14,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.07 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.9766,  1.7969,  3.2969, -2.4844, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.9375, -3.6250, -2.0938,  2.7344, -0.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -4.0312,  0.9688,  0.9453, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -2.1094,  2.1719,  0.6484, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:14,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:21:14,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.33 | bwd_microstep: 1.80 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.84
[2025-11-06 18:21:14,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 564.43 | bwd: 2.85 | bwd_inner: 1.83 | bwd_allreduce: 0.86 | step: 1.93
 43%|████▎     | 1493/3507 [36:28<46:30,  1.39s/it]                                                   {'loss': 0.6371, 'learning_rate': 1.285557547772072e-05, 'epoch': 0.43}
 43%|████▎     | 1493/3507 [36:28<46:30,  1.39s/it]tensor([[-4.3750, -1.3828,  2.0312, -0.6445, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8906, -0.9883,  2.2188,  2.5312, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -0.8906,  2.9062, -2.5156, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -3.0938,  0.0439,  2.2344, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -3.8750,  0.6211, -0.1387, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -1.9531,  0.3887,  2.8281, -1.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -3.1719,  0.4961, -0.7812, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:17,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.1875, -3.4219, -0.1060,  3.0469, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:17,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:21:17,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.75 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.13
[2025-11-06 18:21:17,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.42 | bwd: 2.90 | bwd_inner: 1.88 | bwd_allreduce: 0.88 | step: 2.23
 43%|████▎     | 1494/3507 [36:31<1:00:03,  1.79s/it]                                                     {'loss': 0.8355, 'learning_rate': 1.2846721624266068e-05, 'epoch': 0.43}
 43%|████▎     | 1494/3507 [36:31<1:00:03,  1.79s/it]tensor([[-5.9062, -4.8125, -0.7539,  1.7812, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -3.6250, -0.5195,  2.6562, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2188, -1.1484,  1.9844,  4.0625, -0.8242]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:17,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.86 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0469, -0.2617,  2.0156, -1.3516, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.8438, -2.8438,  0.5938,  0.4961, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -2.6719,  1.5234,  0.4258, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5312, -0.6914,  2.8438, -1.9453, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4688, -1.9219,  0.6445,  0.6133, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:21:18,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:21:18,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.11 | bwd_microstep: 583.60 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 582.34 | step_microstep: 1.75
[2025-11-06 18:21:18,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 436.00 | bwd: 584.57 | bwd_inner: 2.05 | bwd_allreduce: 582.38 | step: 1.82
 43%|████▎     | 1495/3507 [36:32<52:40,  1.57s/it]                                                     {'loss': 0.5717, 'learning_rate': 1.2837865341789399e-05, 'epoch': 0.43}
 43%|████▎     | 1495/3507 [36:32<52:40,  1.57s/it]tensor([[-3.5625, -1.9453,  1.1562,  1.7188, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -3.0000,  1.2031,  0.5625, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -1.0078,  2.7500, -1.0234, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -3.7188,  0.7031,  2.0625, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -3.4219, -0.1387,  3.4375, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:18,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.37 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5000, -3.4688,  0.3594,  0.4746, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.7344, -0.5117,  2.7812, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3438, -5.0312, -0.1924, -0.2119, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:18,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:21:18,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.82 | bwd_microstep: 1.59 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.62 | step_microstep: 1.60
[2025-11-06 18:21:18,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.19 | bwd: 2.39 | bwd_inner: 1.62 | bwd_allreduce: 0.65 | step: 1.67
 43%|████▎     | 1496/3507 [36:32<42:08,  1.26s/it]                                                   {'loss': 0.4225, 'learning_rate': 1.2829006637847514e-05, 'epoch': 0.43}
 43%|████▎     | 1496/3507 [36:32<42:08,  1.26s/it]tensor([[-3.0625, -3.5625, -2.0938,  2.1094, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -3.0156,  1.3359,  1.4531, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:19,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.73 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-5.2188, -3.2656,  0.6562,  0.8828, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -2.8281, -0.0928,  1.7656, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -1.3203,  2.1094,  0.3477, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6406,  0.4160,  2.4062, -1.3125, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7969,  0.6133,  2.5156, -1.3828, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -5.3750, -2.2344,  1.1641, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:21:21,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:21:21,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 2172.41 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 2171.41 | step_microstep: 1.96
[2025-11-06 18:21:21,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.62 | bwd: 2173.43 | bwd_inner: 1.76 | bwd_allreduce: 2171.49 | step: 2.09
 43%|████▎     | 1497/3507 [36:35<55:29,  1.66s/it]                                                   {'loss': 0.4878, 'learning_rate': 1.2820145519999285e-05, 'epoch': 0.43}
 43%|████▎     | 1497/3507 [36:35<55:29,  1.66s/it]tensor([[-0.8828,  2.6875,  3.4219, -1.8047, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0000, -4.0312, -0.3125,  2.5938, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.2031, -0.4102, -1.0938, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8438,  0.0791,  2.9219, -2.2344, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -1.3047,  1.9766,  0.1738, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -3.2812, -0.8711,  3.2031, -1.1953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9062,  1.0469,  2.6250, -1.3203, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:21,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.90 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1562, -2.4531,  0.6445,  1.1250, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:22,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:21:22,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.96 | bwd_microstep: 2.06 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.98 | step_microstep: 1.81
[2025-11-06 18:21:22,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 252.88 | bwd: 2.78 | bwd_inner: 1.61 | bwd_allreduce: 1.02 | step: 1.90
 43%|████▎     | 1498/3507 [36:35<45:43,  1.37s/it]                                                   {'loss': 0.4217, 'learning_rate': 1.2811281995805626e-05, 'epoch': 0.43}
 43%|████▎     | 1498/3507 [36:35<45:43,  1.37s/it]tensor([[-5.1562, -4.2188, -1.1641,  0.9453, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:22,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.58 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -4.3438, -1.5000,  0.4082, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500,  0.5469,  2.4375, -1.9922, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.9688, -1.3750,  0.8672,  0.3691, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -3.6094,  0.7344,  2.7188, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -4.1250, -0.8281,  3.0625, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6875, -4.6250,  0.2344,  1.0000, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -3.3594,  0.7734,  1.5469, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:21:22,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:21:22,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.46 | bwd_microstep: 175.05 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 174.08 | step_microstep: 2.61
[2025-11-06 18:21:22,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.05 | bwd: 175.89 | bwd_inner: 1.63 | bwd_allreduce: 174.12 | step: 2.69
 43%|████▎     | 1499/3507 [36:36<37:43,  1.13s/it]                                                   {'loss': 0.9069, 'learning_rate': 1.2802416072829524e-05, 'epoch': 0.43}
 43%|████▎     | 1499/3507 [36:36<37:43,  1.13s/it]tensor([[-4.2812, -3.5156, -0.7070,  1.6875, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -2.8594,  0.8711, -0.5664, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -2.1562,  1.2578,  1.7500, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4531, -0.2754,  1.6641, -2.1250, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -5.5312, -1.4922, -0.2295, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -1.2891,  2.2500,  1.0469, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -3.2812,  0.9688,  2.3594, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:25,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.62 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.3750, -5.1875, -1.2734,  1.0625, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:25,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:21:25,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.53 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.22
[2025-11-06 18:21:25,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.15 | bwd: 2.77 | bwd_inner: 1.65 | bwd_allreduce: 0.95 | step: 2.32
 43%|████▎     | 1500/3507 [36:39<52:28,  1.57s/it]                                                   {'loss': 1.0626, 'learning_rate': 1.2793547758636002e-05, 'epoch': 0.43}
 43%|████▎     | 1500/3507 [36:39<52:28,  1.57s/it]tensor([[-7.1562, -6.9375, -3.6094,  0.5898, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2031, -0.5898,  2.3438,  2.7344, -1.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.3438, -0.5391,  2.2031, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:25,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 1.47 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.3594, -2.2500,  0.5352,  1.9922, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -3.0938,  0.9492,  0.5742, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -4.0938, -0.8125,  2.2344, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -3.9375, -0.6289,  2.5156, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3594, -2.0469, -0.0581,  2.7031, -0.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:25,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:21:25,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.46 | bwd_microstep: 1.57 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.61 | step_microstep: 2.82
[2025-11-06 18:21:25,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.18 | bwd: 3.03 | bwd_inner: 2.24 | bwd_allreduce: 0.65 | step: 2.92
 43%|████▎     | 1501/3507 [36:39<41:09,  1.23s/it]                                                   {'loss': 0.204, 'learning_rate': 1.278467706079213e-05, 'epoch': 0.43}
 43%|████▎     | 1501/3507 [36:39<41:09,  1.23s/it]tensor([[-3.4062, -3.7500, -1.5312,  3.0938, -1.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.7656, -0.3633,  2.9375, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125, -0.3535,  2.4219, -0.5078, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6016,  1.9688,  3.3125, -2.1406, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.3203,  1.4453,  3.7031,  1.2500, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -4.0938,  0.3750,  1.3359, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.6094, -0.5586,  2.6875, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:27,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.04 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.5625, -3.4844,  1.0391,  1.5547, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:27,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:21:27,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.40 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.76
[2025-11-06 18:21:27,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.46 | bwd: 2.74 | bwd_inner: 1.70 | bwd_allreduce: 0.88 | step: 1.86
 43%|████▎     | 1502/3507 [36:41<45:58,  1.38s/it]                                                   {'loss': 0.4712, 'learning_rate': 1.2775803986867001e-05, 'epoch': 0.43}
 43%|████▎     | 1502/3507 [36:41<45:58,  1.38s/it]tensor([[-4.5625, -3.1719,  0.6797,  2.3438, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -2.8906,  0.4531,  1.1719, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031, -0.3555,  2.2656, -0.8984, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -2.3594,  1.0078,  0.5234, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1406,  0.5391,  2.7188, -2.6250, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:28,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.66 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.6484,  1.5000,  2.2188, -2.3750, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.0312, -3.9688, -1.0391,  3.2188, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -3.5312, -0.6367,  2.5469, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:21:29,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:21:29,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.98 | bwd_microstep: 306.69 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 305.73 | step_microstep: 1.73
[2025-11-06 18:21:29,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 534.67 | bwd: 307.37 | bwd_inner: 1.45 | bwd_allreduce: 305.78 | step: 1.81
 43%|████▎     | 1503/3507 [36:42<48:19,  1.45s/it]                                                   {'loss': 0.4116, 'learning_rate': 1.2766928544431748e-05, 'epoch': 0.43}
 43%|████▎     | 1503/3507 [36:42<48:19,  1.45s/it]tensor([[-4.8125, -5.0938, -2.4844,  2.0781, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906, -0.9883,  1.7734,  0.0117, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -4.3438, -0.3652,  1.9609, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062,  0.0811,  3.0156, -0.8633, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -3.6094,  0.2305,  2.0312, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -4.8438, -0.1660,  1.4609, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -3.9844, -0.1260,  2.2188, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:30,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.03 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.7500, -4.1562, -1.0938,  1.9688, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:30,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:21:30,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.20 | bwd_microstep: 2.11 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.00
[2025-11-06 18:21:30,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.24 | bwd: 3.01 | bwd_inner: 2.02 | bwd_allreduce: 0.86 | step: 2.09
 43%|████▎     | 1504/3507 [36:44<45:49,  1.37s/it]                                                   {'loss': 0.9017, 'learning_rate': 1.275805074105951e-05, 'epoch': 0.43}
 43%|████▎     | 1504/3507 [36:44<45:49,  1.37s/it]tensor([[-3.8906, -0.5391,  2.7500, -1.2812, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0938, -5.6562, -0.7656,  1.7266, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -1.5938,  2.1875, -0.5391, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -0.4277,  2.5000, -2.6094, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -4.2188, -0.6953,  2.3125, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:30,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.44 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9688, -3.8125,  1.0234,  1.6172, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.8438, -0.1904,  2.4531, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -3.8594, -0.3066,  0.9375, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:21:31,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:21:31,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.37 | bwd_microstep: 194.47 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 193.36 | step_microstep: 1.99
[2025-11-06 18:21:31,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.83 | bwd: 195.41 | bwd_inner: 1.89 | bwd_allreduce: 193.39 | step: 2.08
 43%|████▎     | 1505/3507 [36:45<41:09,  1.23s/it]                                                   {'loss': 0.1415, 'learning_rate': 1.2749170584325465e-05, 'epoch': 0.43}
 43%|████▎     | 1505/3507 [36:45<41:09,  1.23s/it]tensor([[-7.0000, -5.1875, -0.1025,  1.8203, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -5.4375, -1.4688,  1.0234, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -2.9219, -1.6094,  2.2188, -0.6914]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.0938,  1.5391,  3.5000, -1.7812, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.0312, -0.8398,  1.9297, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -2.9844,  0.0282,  0.9688, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -4.3125, -1.2188,  1.8438, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:33,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.18 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.5000, -3.1406,  0.4824,  2.0625, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:33,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:21:33,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.88 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 2.08
[2025-11-06 18:21:33,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.07 | bwd: 2.72 | bwd_inner: 1.84 | bwd_allreduce: 0.77 | step: 2.15
 43%|████▎     | 1506/3507 [36:47<50:17,  1.51s/it]                                                   {'loss': 0.625, 'learning_rate': 1.274028808180677e-05, 'epoch': 0.43}
 43%|████▎     | 1506/3507 [36:47<50:17,  1.51s/it]tensor([[-4.6875, -2.8125,  1.0156,  1.7578, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -2.7656,  0.6367,  1.9141, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.7344,  0.7148,  2.3594, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:33,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.01 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5938, -3.0938,  0.6211,  2.0156, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9375, -1.1328,  1.8359, -0.8125, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -1.0312,  2.5938, -1.1016, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -3.2969, -0.0820,  1.4688, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -4.5000, -0.3887,  3.1562, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:21:34,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:21:34,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.77 | bwd_microstep: 280.54 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 279.43 | step_microstep: 1.63
[2025-11-06 18:21:34,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.81 | bwd: 281.44 | bwd_inner: 1.82 | bwd_allreduce: 279.47 | step: 1.72
 43%|████▎     | 1507/3507 [36:47<43:11,  1.30s/it]                                                   {'loss': 0.1882, 'learning_rate': 1.2731403241082609e-05, 'epoch': 0.43}
 43%|████▎     | 1507/3507 [36:47<43:11,  1.30s/it]tensor([[-4.4375, -3.4531, -0.2773,  2.1562, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -3.5000,  0.0284,  2.4219, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3125, -0.9961,  1.4297,  0.3086, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -0.3867,  2.4375, -1.6875, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.8438,  0.6055,  1.0234, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.6562, -6.0938, -1.9922, -0.4219, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -0.8359,  1.9766, -2.9219, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:35,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.56 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7969, -2.3750,  0.5352,  1.2891, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:36,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:21:36,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.74 | bwd_microstep: 1.91 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.89 | step_microstep: 1.65
[2025-11-06 18:21:36,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.32 | bwd: 2.82 | bwd_inner: 1.77 | bwd_allreduce: 0.92 | step: 1.74
 43%|████▎     | 1508/3507 [36:49<49:48,  1.50s/it]                                                   {'loss': 0.5748, 'learning_rate': 1.2722516069734142e-05, 'epoch': 0.43}
 43%|████▎     | 1508/3507 [36:49<49:48,  1.50s/it]tensor([[-3.4531,  0.0864,  2.2500, -2.6094, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -2.8594,  1.4141,  1.6094, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -4.9375, -2.2500,  3.1719, -1.8359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:36,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.94 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4062, -3.5625,  0.8594, -0.6719, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500, -1.7578,  1.8672,  2.2656, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -0.8828,  3.0000, -0.3047, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2969,  2.5781,  3.8750, -2.2969, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7500, -3.5469, -0.5547,  0.7461, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:21:37,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:21:37,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.28 | bwd_microstep: 822.22 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 821.09 | step_microstep: 1.74
[2025-11-06 18:21:37,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 263.23 | bwd: 823.12 | bwd_inner: 1.83 | bwd_allreduce: 821.14 | step: 1.82
 43%|████▎     | 1509/3507 [36:51<45:59,  1.38s/it]                                                   {'loss': 1.2608, 'learning_rate': 1.2713626575344525e-05, 'epoch': 0.43}
 43%|████▎     | 1509/3507 [36:51<45:59,  1.38s/it]tensor([[-2.7188,  0.2305,  2.5000, -0.8164, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -2.3125,  1.0469,  3.2188, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -1.8984,  2.2656, -0.2354, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -2.9688,  1.8906,  0.4746, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -3.4844,  0.3691,  0.6211, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -3.5312, -1.1094,  1.9375, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:37,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.56 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -1.6797,  1.4609, -0.4219, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -4.0625, -0.2168,  2.9844, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:38,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 18:21:38,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.63 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.92
[2025-11-06 18:21:38,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.20 | bwd: 3.03 | bwd_inner: 2.02 | bwd_allreduce: 0.87 | step: 2.00
 43%|████▎     | 1510/3507 [36:52<42:27,  1.28s/it]                                                   {'loss': 1.0414, 'learning_rate': 1.2704734765498896e-05, 'epoch': 0.43}
 43%|████▎     | 1510/3507 [36:52<42:27,  1.28s/it]tensor([[-4.1250, -3.5469, -0.0679,  3.4688, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -2.9688, -0.6992,  1.9141, -1.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.6094,  0.6641,  1.4219, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:38,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.23 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6250, -5.2188, -0.4922,  1.8906, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -3.3906, -0.6562,  1.2812, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -0.0942,  3.3750, -1.3828, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -3.7656, -0.5312,  2.3594, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -2.7969,  1.4375,  0.4355, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:21:39,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:21:39,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.34 | bwd_microstep: 249.52 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 248.43 | step_microstep: 1.83
[2025-11-06 18:21:39,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 532.59 | bwd: 250.48 | bwd_inner: 1.88 | bwd_allreduce: 248.47 | step: 1.91
 43%|████▎     | 1511/3507 [36:52<37:57,  1.14s/it]                                                   {'loss': 0.2487, 'learning_rate': 1.2695840647784378e-05, 'epoch': 0.43}
 43%|████▎     | 1511/3507 [36:52<37:57,  1.14s/it]tensor([[-3.7344, -3.8750, -1.7188,  2.2656, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7812,  2.1406,  2.6250, -1.6016, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.0625,  0.7852,  3.4375, -1.7500, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -3.0625,  1.5000,  1.4062, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -1.0312,  1.6953, -1.4219, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2500,  1.3359,  2.4688, -2.9688, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -0.5156,  2.2188, -2.2031, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:42,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.93 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.7188, -2.3594,  1.1484, -0.0181, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:42,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:21:42,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.68 | bwd_microstep: 2.28 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.95
[2025-11-06 18:21:42,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.63 | bwd: 3.24 | bwd_inner: 2.14 | bwd_allreduce: 0.96 | step: 3.04
 43%|████▎     | 1512/3507 [36:56<1:02:36,  1.88s/it]                                                     {'loss': 0.3326, 'learning_rate': 1.2686944229790044e-05, 'epoch': 0.43}
 43%|████▎     | 1512/3507 [36:56<1:02:36,  1.88s/it]tensor([[-6.4688, -4.3750,  0.4668,  1.1797, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -3.9219, -1.1406,  3.2344, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -2.8750,  1.2188,  1.7891, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.5938, -0.3789,  2.8281, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:42,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.76 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-2.7344,  0.7461,  2.9688, -1.4688, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -4.9375, -0.2637,  2.3438, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -1.9453,  1.2344,  1.4297, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.3125, -4.2812,  1.0781, -0.1855, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:21:43,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:21:43,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.04 | bwd_microstep: 118.89 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 117.79 | step_microstep: 1.57
[2025-11-06 18:21:43,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.83 | bwd: 119.55 | bwd_inner: 1.61 | bwd_allreduce: 117.82 | step: 1.63
 43%|████▎     | 1513/3507 [36:57<48:55,  1.47s/it]                                                     {'loss': 0.2756, 'learning_rate': 1.2678045519106948e-05, 'epoch': 0.43}
 43%|████▎     | 1513/3507 [36:57<48:55,  1.47s/it]tensor([[-3.2188,  0.5508,  3.4062, -1.4453, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.3438, -0.8164,  2.6250, -1.0781, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.2812, -4.0625, -1.2422,  2.3281, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-6.1250, -3.8125, -0.7812, -1.7812, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2031, -0.4297,  1.9922, -0.7930, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -0.3516,  3.0156, -1.9219, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -0.5000,  2.5156, -1.6797, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:44,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.06 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.8438, -3.1406,  0.6094,  1.6719, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:44,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.20
[2025-11-06 18:21:44,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.92 | bwd_microstep: 2.11 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.94 | step_microstep: 1.77
[2025-11-06 18:21:44,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.00 | bwd: 3.05 | bwd_inner: 1.90 | bwd_allreduce: 0.99 | step: 1.87
 43%|████▎     | 1514/3507 [36:58<51:50,  1.56s/it]                                                   {'loss': 1.2162, 'learning_rate': 1.2669144523328082e-05, 'epoch': 0.43}
 43%|████▎     | 1514/3507 [36:58<51:50,  1.56s/it]tensor([[-3.2344, -0.0635,  1.9531, -1.8125, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -1.5391,  2.2500,  0.5000, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:45,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.55 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.2188, -2.0781,  2.3438,  0.3379, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.3906,  0.3438,  2.1875, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -4.7812, -0.0488,  1.9688, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5000, -2.7500,  0.0781,  2.4688, -1.8359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -1.7578,  2.5938,  0.9531, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -1.0234,  2.8594, -1.1172, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:21:45,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:21:45,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.46 | bwd_microstep: 68.23 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 67.19 | step_microstep: 1.39
[2025-11-06 18:21:45,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 273.04 | bwd: 69.09 | bwd_inner: 1.75 | bwd_allreduce: 67.22 | step: 1.46
 43%|████▎     | 1515/3507 [36:59<39:58,  1.20s/it]                                                   {'loss': 0.9171, 'learning_rate': 1.2660241250048409e-05, 'epoch': 0.43}
 43%|████▎     | 1515/3507 [36:59<39:58,  1.20s/it]tensor([[-2.9375, -3.1719, -1.2578,  2.5156, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.8750,  0.8711,  3.3906, -1.6406, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([2], device='cuda:3')
tensor([[-4.5625, -3.3438,  0.1299,  1.8906, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -3.9375,  0.2812,  1.6719, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -0.0806,  2.4219, -0.9609, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6094,  1.2812,  3.6875, -1.4219, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.4062, -4.5625, -0.4727,  0.4082, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:48,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.49 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -4.0312, -0.6875,  3.1562, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:49,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:21:49,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.26 | bwd_microstep: 1.89 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.96
[2025-11-06 18:21:49,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 410.77 | bwd: 2.93 | bwd_inner: 1.97 | bwd_allreduce: 0.84 | step: 2.04
 43%|████▎     | 1516/3507 [37:02<1:04:48,  1.95s/it]                                                     {'loss': 0.5528, 'learning_rate': 1.265133570686482e-05, 'epoch': 0.43}
 43%|████▎     | 1516/3507 [37:02<1:04:48,  1.95s/it]tensor([[-1.8438,  1.6797,  2.8750, -1.7891, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8438, -4.3125, -2.5938,  1.6250, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:49,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.65 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1562, -2.9062,  1.7891, -0.4297, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9844,  1.0469,  2.6250, -1.2969, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -1.2344,  1.8984,  0.6133, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -3.8906, -0.5898,  1.2578, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -2.3438,  0.8164,  0.8750, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6250, -5.6875, -1.5938, -0.8750, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:21:49,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:21:49,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.49 | bwd_microstep: 171.84 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 170.85 | step_microstep: 2.65
[2025-11-06 18:21:49,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 272.17 | bwd: 172.62 | bwd_inner: 1.60 | bwd_allreduce: 170.89 | step: 2.74
 43%|████▎     | 1517/3507 [37:03<50:04,  1.51s/it]                                                     {'loss': 0.5176, 'learning_rate': 1.2642427901376147e-05, 'epoch': 0.43}
 43%|████▎     | 1517/3507 [37:03<50:04,  1.51s/it]tensor([[-6.0938, -4.6250, -0.6133,  0.9102, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-10.6250, -10.0625,  -5.8438,  -1.9688,  -7.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.9844, -0.7812,  3.4219, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0312, -5.7812, -1.7344,  0.3516, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.1875,  0.7734,  2.2969, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -1.7578,  2.4375, -0.5781, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -0.2969,  2.3594,  0.2305, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:51,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.83 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0312, -1.9531,  0.0688,  3.5156, -0.4277]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:51,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:21:51,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.55 | bwd_microstep: 1.83 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.12
[2025-11-06 18:21:51,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.40 | bwd: 2.64 | bwd_inner: 1.74 | bwd_allreduce: 0.77 | step: 2.20
 43%|████▎     | 1518/3507 [37:05<52:25,  1.58s/it]                                                   {'loss': 0.3011, 'learning_rate': 1.2633517841183151e-05, 'epoch': 0.43}
 43%|████▎     | 1518/3507 [37:05<52:25,  1.58s/it]tensor([[-3.8594, -3.3906, -0.4316,  2.7500, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -1.9453,  1.6562, -0.6523, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:51,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.83 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.6875, -3.7656,  0.5508,  1.3125, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.7812, -0.8594,  2.5000, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -2.3750,  0.6992, -0.0471, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -4.6875, -2.1094,  1.8828, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7734,  1.3359,  2.5156, -1.4531, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -4.4688, -1.5469,  2.5938, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:21:51,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:21:51,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.51 | bwd_microstep: 27.28 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 26.23 | step_microstep: 1.90
[2025-11-06 18:21:51,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.37 | bwd: 28.09 | bwd_inner: 1.72 | bwd_allreduce: 26.25 | step: 1.96
 43%|████▎     | 1519/3507 [37:05<40:30,  1.22s/it]                                                   {'loss': 0.6916, 'learning_rate': 1.2624605533888526e-05, 'epoch': 0.43}
 43%|████▎     | 1519/3507 [37:05<40:30,  1.22s/it]tensor([[-3.1406, -2.3125,  0.9648,  3.7031, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9922,  0.0679,  3.0156,  2.8750, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7188, -0.5117,  3.1719,  0.4785, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8008,  0.1641,  2.1562,  3.4375,  0.0488]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6562, -3.7969, -1.3203,  2.7656, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812,  1.1328,  3.0000, -1.2578, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.6562, -2.4062,  1.1016, -1.8438, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:55,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.01 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3438, -4.2188,  0.6328,  1.2422, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:55,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:21:55,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.19 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.86 | step_microstep: 124.58
[2025-11-06 18:21:55,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 578.22 | bwd: 2.75 | bwd_inner: 1.74 | bwd_allreduce: 0.89 | step: 124.66
 43%|████▎     | 1520/3507 [37:09<1:08:46,  2.08s/it]                                                     {'loss': 0.8515, 'learning_rate': 1.2615690987096866e-05, 'epoch': 0.43}
 43%|████▎     | 1520/3507 [37:09<1:08:46,  2.08s/it]tensor([[-4.6875, -2.3438,  1.4844,  0.9297, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -0.9648,  3.2812, -0.1533, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -4.0312, -1.8594,  2.5312, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7188, -4.8125,  0.0212,  1.2500, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:55,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.25 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5938, -3.7656,  0.6328,  2.0469, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -3.9688, -1.5000,  2.4062, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.7500,  1.3047,  1.8672, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -2.1250,  0.6211,  2.9688, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:21:56,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:21:56,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.88 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.10
[2025-11-06 18:21:56,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.17 | bwd: 2.98 | bwd_inner: 1.88 | bwd_allreduce: 0.96 | step: 2.17
 43%|████▎     | 1521/3507 [37:09<52:32,  1.59s/it]                                                     {'loss': 0.2761, 'learning_rate': 1.2606774208414694e-05, 'epoch': 0.43}
 43%|████▎     | 1521/3507 [37:09<52:32,  1.59s/it]tensor([[-6.4375, -6.3438, -3.1406,  0.9648, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -4.6250, -1.9844,  2.7500, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -0.3887,  2.2500, -1.6562, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -1.9297,  2.3906, -0.1738, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:56,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.06 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.4375, -4.3750, -0.5430,  1.8281, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -3.7812, -0.8516,  2.1875, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -4.1875, -0.6758,  2.3906, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7500, -6.9062, -3.9844,  0.3965, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:21:56,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:21:56,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 78.64 | bwd_microstep: 53.62 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 52.48 | step_microstep: 1.72
[2025-11-06 18:21:56,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.73 | bwd: 54.42 | bwd_inner: 1.77 | bwd_allreduce: 52.52 | step: 1.79
 43%|████▎     | 1522/3507 [37:10<40:38,  1.23s/it]                                                   {'loss': 0.0578, 'learning_rate': 1.2597855205450427e-05, 'epoch': 0.43}
 43%|████▎     | 1522/3507 [37:10<40:38,  1.23s/it]tensor([[-4.1875, -0.0688,  2.9375, -2.7969, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7656, -2.3438,  0.5117,  3.8594, -0.9961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -2.6094,  0.6719,  0.8750, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -1.2266,  0.7852, -4.0000, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:21:56,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.40 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -4.0312,  0.5430,  1.6406, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6250, -3.4375, -0.4453,  3.6562, -1.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.4062,  0.2695,  3.2812, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.9062, -5.1875, -0.4785,  1.0156, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([3], device='cuda:1')
[2025-11-06 18:21:56,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.17 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:21:56,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.69 | bwd_microstep: 1.73 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.65 | step_microstep: 2.82
[2025-11-06 18:21:56,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.12 | bwd: 2.74 | bwd_inner: 1.93 | bwd_allreduce: 0.68 | step: 2.90
 43%|████▎     | 1523/3507 [37:10<32:16,  1.02it/s]                                                   {'loss': 0.432, 'learning_rate': 1.2588933985814377e-05, 'epoch': 0.43}
 43%|████▎     | 1523/3507 [37:10<32:16,  1.02it/s]tensor([[-2.9531, -3.5625, -2.2188,  1.9141, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -2.8906,  1.3984,  0.2754, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.2969, -0.0349,  1.7344, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -1.2891,  2.0312, -0.4062, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -3.8750, -0.2490,  0.8320, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -4.0312,  0.7852,  1.7266, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -0.0212,  2.6875, -2.3594, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:21:59,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 103.20 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.7188, -3.6719,  0.8594,  1.6953, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:00,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:22:00,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.56 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.45
[2025-11-06 18:22:00,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.77 | bwd: 2.79 | bwd_inner: 1.77 | bwd_allreduce: 0.88 | step: 2.53
 43%|████▎     | 1524/3507 [37:13<53:46,  1.63s/it]                                                   {'loss': 0.2192, 'learning_rate': 1.258001055711876e-05, 'epoch': 0.43}
 43%|████▎     | 1524/3507 [37:13<53:46,  1.63s/it]tensor([[-4.1875, -3.8750, -0.8242,  2.6250, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531,  0.7969,  2.9688, -2.2969, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.5469,  0.4258,  1.8750, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1250, -0.6719,  2.0625,  0.7930, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:00,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.9062, -5.7500, -1.6484,  0.6523, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7188,  1.5703,  3.1562, -0.8555, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.0312, -5.1250, -1.1406,  1.5078, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -3.3750,  0.2207,  1.6484, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:00,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:22:00,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.28 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.00
[2025-11-06 18:22:00,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.88 | bwd: 2.52 | bwd_inner: 1.53 | bwd_allreduce: 0.83 | step: 2.09
 43%|████▎     | 1525/3507 [37:14<41:49,  1.27s/it]                                                   {'loss': 0.6918, 'learning_rate': 1.2571084926977669e-05, 'epoch': 0.43}
 43%|████▎     | 1525/3507 [37:14<41:49,  1.27s/it]tensor([[-2.1875,  1.0859,  2.4844, -1.9219, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -3.7188,  0.8672,  2.4375, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -4.3750, -1.4062,  2.6719, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.8281,  0.4316,  1.8516, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7812, -2.8906,  0.7461,  3.5781, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -1.1016,  1.9453,  0.2793, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281, -0.8516,  2.5625,  1.5156, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:02,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.73 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.7031, -3.6094, -1.0156,  2.7656, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:02,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:22:02,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.50 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.53
[2025-11-06 18:22:02,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.25 | bwd: 3.07 | bwd_inner: 1.98 | bwd_allreduce: 0.93 | step: 2.63
 44%|████▎     | 1526/3507 [37:16<50:35,  1.53s/it]                                                   {'loss': 0.1687, 'learning_rate': 1.2562157103007069e-05, 'epoch': 0.44}
 44%|████▎     | 1526/3507 [37:16<50:35,  1.53s/it]tensor([[-5.8125, -2.8750,  1.0078, -0.8672, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.0781,  1.8438,  3.6406, -2.1562, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([1], device='cuda:2')
tensor([[-4.3125, -3.5000, -0.5508,  1.3281, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:02,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.09 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9062, -0.7773,  3.2188, -1.7578, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6719,  0.6836,  1.8516, -0.9688, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -3.2969, -0.7070,  2.8906, -1.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.8750,  1.0234,  2.6406, -3.0781, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.5781,  0.1885,  3.7188, -0.5664, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:03,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:22:03,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.08 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.66
[2025-11-06 18:22:03,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 450.19 | bwd: 2.73 | bwd_inner: 1.69 | bwd_allreduce: 0.89 | step: 1.75
 44%|████▎     | 1527/3507 [37:16<40:20,  1.22s/it]                                                   {'loss': 1.3548, 'learning_rate': 1.2553227092824812e-05, 'epoch': 0.44}
 44%|████▎     | 1527/3507 [37:16<40:20,  1.22s/it]tensor([[-1.2578,  2.2969,  3.4062, -1.6719, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9531, -2.4531,  0.5898,  1.5234, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8594,  1.2734,  2.0000, -2.1875, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6719, -0.0972,  1.8594, -0.4258, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3125, -4.8750, -0.7617,  0.8281, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-9.0625, -5.6875, -1.6016, -4.5625, -7.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -2.4219,  1.0625,  1.1094, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:05,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.15
tensor([[-3.6094, -0.0109,  3.1875, -0.9570, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:06,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.86 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:22:06,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 295.78 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.81 | step_microstep: 3.13
[2025-11-06 18:22:06,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 497.42 | bwd: 2.77 | bwd_inner: 1.81 | bwd_allreduce: 0.84 | step: 3.28
 44%|████▎     | 1528/3507 [37:19<57:00,  1.73s/it]                                                   {'loss': 0.914, 'learning_rate': 1.25442949040506e-05, 'epoch': 0.44}
 44%|████▎     | 1528/3507 [37:19<57:00,  1.73s/it]tensor([[-5.5938, -3.9844,  0.5039,  2.4062, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -1.6562,  1.8750, -0.4961, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:06,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.96 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.9062, -0.8672,  2.5156, -0.2314, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.0625, -4.3438,  0.7305,  0.0439, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -1.3672,  2.2812, -0.2383, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7812, -5.8438, -1.4453,  1.8125, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -3.0312, -0.0186,  2.8594, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7969, -0.5195,  2.8125, -0.5586, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:22:06,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:22:06,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 68.32 | bwd_microstep: 418.22 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 416.94 | step_microstep: 1.50
[2025-11-06 18:22:06,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 211.31 | bwd: 419.11 | bwd_inner: 1.96 | bwd_allreduce: 416.98 | step: 1.59
 44%|████▎     | 1529/3507 [37:20<46:24,  1.41s/it]                                                   {'loss': 0.2182, 'learning_rate': 1.2535360544306007e-05, 'epoch': 0.44}
 44%|████▎     | 1529/3507 [37:20<46:24,  1.41s/it]tensor([[-7.4688, -6.1250, -1.7109,  0.5742, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -2.6719,  1.7734, -0.1377, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -5.0625, -3.0781,  1.6406, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.8438, -4.7188, -2.3281,  1.0391, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -1.5156,  1.6484,  0.8398, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -3.9375,  0.0928,  1.2266, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -4.9688, -0.2256,  1.2812, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:07,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.77 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.9531,  0.1719,  2.6875, -1.0391, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:07,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:22:07,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.74 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.14
[2025-11-06 18:22:07,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.53 | bwd: 2.98 | bwd_inner: 1.90 | bwd_allreduce: 0.91 | step: 2.25
 44%|████▎     | 1530/3507 [37:21<43:13,  1.31s/it]                                                   {'loss': 0.636, 'learning_rate': 1.2526424021214452e-05, 'epoch': 0.44}
 44%|████▎     | 1530/3507 [37:21<43:13,  1.31s/it]tensor([[-3.2656, -2.2969,  0.3340,  2.0000, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:07,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.95 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.2188, -2.2031,  1.7969,  2.1250, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5625,  1.7031,  3.0469, -1.3125, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -3.8125,  0.0239,  2.8125, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6875, -1.3203,  0.9492, -0.3008, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -1.7031,  1.2812,  0.4766, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -0.3906,  2.6719, -1.9453, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.4375,  0.2383,  2.9844, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:08,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.19 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:22:08,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.00 | bwd_microstep: 274.32 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 273.25 | step_microstep: 3.32
[2025-11-06 18:22:08,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.97 | bwd: 275.34 | bwd_inner: 1.91 | bwd_allreduce: 273.30 | step: 3.42
 44%|████▎     | 1531/3507 [37:22<36:32,  1.11s/it]                                                   {'loss': 0.637, 'learning_rate': 1.2517485342401201e-05, 'epoch': 0.44}
 44%|████▎     | 1531/3507 [37:22<36:32,  1.11s/it]tensor([[-5.1875, -4.6562, -1.2031,  2.1250, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -2.3281,  0.8984,  2.3438, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -3.4375,  1.1641,  1.4922, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -3.5312,  0.0201,  1.8281, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562, -1.8125,  1.0156,  2.2812, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -3.9688,  0.1514,  0.4473, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:09,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.44 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -4.8438, -0.6641,  2.0469, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -3.8281,  0.8711,  1.2500, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:09,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.23 | optimizer_step: 0.27
[2025-11-06 18:22:09,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.22 | bwd_microstep: 1.71 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.77 | step_microstep: 32.43
[2025-11-06 18:22:09,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.67 | bwd: 2.69 | bwd_inner: 1.77 | bwd_allreduce: 0.80 | step: 32.51
 44%|████▎     | 1532/3507 [37:23<40:40,  1.24s/it]                                                   {'loss': 0.568, 'learning_rate': 1.2508544515493356e-05, 'epoch': 0.44}
 44%|████▎     | 1532/3507 [37:23<40:40,  1.24s/it]tensor([[-3.1875, -3.5938, -2.0625,  1.6484, -1.2734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.9688, -2.7344,  2.2188,  0.2158, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:10,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.72 | bwd_microstep: 1.15 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.2500, -3.2031,  1.4375, -0.2197, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -0.6484,  2.8438,  0.3379, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4844, -1.5859,  1.3672,  1.2969, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -0.2812,  3.3281,  0.1611, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -2.7344,  1.3047,  3.1094, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.1719,  0.5508,  1.6406, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:12,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:22:12,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 69.22 | bwd_microstep: 1757.42 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1756.10 | step_microstep: 2.14
[2025-11-06 18:22:12,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.96 | bwd: 1758.57 | bwd_inner: 2.25 | bwd_allreduce: 1756.16 | step: 2.24
 44%|████▎     | 1533/3507 [37:25<49:29,  1.50s/it]                                                   {'loss': 0.5932, 'learning_rate': 1.2499601548119868e-05, 'epoch': 0.44}
 44%|████▎     | 1533/3507 [37:25<49:29,  1.50s/it]tensor([[-4.3125, -4.3750, -1.7344,  2.3750, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8438,  1.7812,  2.7969, -2.7812, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.1250, -2.1250,  1.2734, -1.2188, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0469,  1.1484,  2.1250, -2.0625, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:12,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6875, -2.7969,  0.7266,  0.8516, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -0.9727,  2.8906, -1.0938, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -3.4062, -0.3027,  2.4219, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -2.8281,  0.8672,  0.8750, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:12,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:22:12,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.97 | bwd_microstep: 101.37 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 100.52 | step_microstep: 1.98
[2025-11-06 18:22:12,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 283.63 | bwd: 102.05 | bwd_inner: 1.35 | bwd_allreduce: 100.56 | step: 2.07
 44%|████▎     | 1534/3507 [37:26<38:44,  1.18s/it]                                                   {'loss': 0.4308, 'learning_rate': 1.2490656447911489e-05, 'epoch': 0.44}
 44%|████▎     | 1534/3507 [37:26<38:44,  1.18s/it]tensor([[-3.8281, -0.2236,  2.7344, -1.5625, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -3.5312, -1.7656,  1.3906, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -2.5312,  2.0781, -0.5938, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:12,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.63 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.6875,  2.2344,  2.0156, -2.1094, -1.3359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5000, -3.6094,  0.0635,  3.1562, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7500, -3.3750,  1.5547, -0.7930, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.8438, -6.2812, -1.9688, -0.4453, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1250, -3.7812, -1.8906,  2.7656, -0.9883]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:16,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.23 | optimizer_step: 0.33
[2025-11-06 18:22:16,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.10 | bwd_microstep: 3089.08 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 3087.93 | step_microstep: 2.65
[2025-11-06 18:22:16,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.76 | bwd: 3090.00 | bwd_inner: 1.86 | bwd_allreduce: 3087.98 | step: 2.73
 44%|████▍     | 1535/3507 [37:29<1:01:54,  1.88s/it]                                                     {'loss': 0.1591, 'learning_rate': 1.2481709222500813e-05, 'epoch': 0.44}
 44%|████▍     | 1535/3507 [37:29<1:01:54,  1.88s/it]tensor([[-4.4062, -1.6250,  2.2500,  0.6797, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -2.4844,  1.1484, -0.5898, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -2.7031,  0.9570,  1.1641, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -3.5312,  0.5508,  2.5781, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:16,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.62 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-5.4688, -3.5469,  1.0469,  2.0312, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -3.4531, -0.6289,  2.5156, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -4.0312, -1.2031,  2.6250, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -1.1094,  3.0000,  0.1758, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:22:16,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:22:16,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.99 | bwd_microstep: 100.60 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 99.48 | step_microstep: 1.74
[2025-11-06 18:22:16,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.64 | bwd: 101.57 | bwd_inner: 1.88 | bwd_allreduce: 99.53 | step: 1.83
 44%|████▍     | 1536/3507 [37:30<48:08,  1.47s/it]                                                     {'loss': 0.2038, 'learning_rate': 1.2472759879522234e-05, 'epoch': 0.44}
 44%|████▍     | 1536/3507 [37:30<48:08,  1.47s/it]tensor([[-3.4375, -3.8906, -1.9375,  2.2188, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:16,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.26 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5312, -4.3438, -0.2598,  2.2500, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -3.8125,  1.3125,  0.6758, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -1.7891,  1.4531, -0.4941, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -2.1875,  1.6406,  0.5859, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4141,  2.1875,  3.0625, -2.4219, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4062, -2.6094,  1.4375, -0.0364, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.4141,  0.6680,  3.5156,  3.2812, -0.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:20,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:22:20,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.32 | bwd_microstep: 3214.41 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 3213.28 | step_microstep: 2.18
[2025-11-06 18:22:20,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.61 | bwd: 3215.10 | bwd_inner: 1.63 | bwd_allreduce: 3213.32 | step: 2.26
 44%|████▍     | 1537/3507 [37:34<1:09:43,  2.12s/it]                                                     {'loss': 0.2833, 'learning_rate': 1.2463808426611958e-05, 'epoch': 0.44}
 44%|████▍     | 1537/3507 [37:34<1:09:43,  2.12s/it]tensor([[-4.6875, -5.1562, -2.8906,  1.6641, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -6.2812, -2.7344,  1.0547, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:20,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.03 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7812, -3.6875, -0.1040,  2.0625, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -2.7031,  1.5859,  1.6953, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -4.2188, -1.5078,  2.3750, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219, -3.4219, -0.9102,  2.9531, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -3.9531, -1.2266,  3.4688, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8750,  1.6328,  3.8750, -0.5469, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:20,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:22:20,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.16 | bwd_microstep: 243.32 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 242.45 | step_microstep: 1.53
[2025-11-06 18:22:20,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.22 | bwd: 244.27 | bwd_inner: 1.65 | bwd_allreduce: 242.49 | step: 1.61
 44%|████▍     | 1538/3507 [37:34<55:04,  1.68s/it]                                                     {'loss': 0.143, 'learning_rate': 1.2454854871407993e-05, 'epoch': 0.44}
 44%|████▍     | 1538/3507 [37:34<55:04,  1.68s/it]tensor([[-3.5625, -3.8125, -1.7500,  2.1562, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -3.1406,  0.6562, -0.3730, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -4.1562, -0.0544,  1.5938, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:21,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.9375, -3.4375, -0.2832,  2.9531, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -6.1562, -1.8750,  0.4766, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3438, -5.7500, -1.4297,  0.0223, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -3.3906, -0.3184,  2.6094, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -3.7188,  0.1206,  2.5625, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:23,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:22:23,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 1878.38 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1877.31 | step_microstep: 1.85
[2025-11-06 18:22:23,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.65 | bwd: 1879.31 | bwd_inner: 1.84 | bwd_allreduce: 1877.35 | step: 1.92
 44%|████▍     | 1539/3507 [37:36<1:01:05,  1.86s/it]                                                     {'loss': 0.2639, 'learning_rate': 1.2445899221550137e-05, 'epoch': 0.44}
 44%|████▍     | 1539/3507 [37:36<1:01:05,  1.86s/it]tensor([[-5.0625, -2.7812,  1.0391,  0.6289, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9844, -0.3496,  2.5156,  0.7344, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -5.6875, -1.2344,  2.6094, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:23,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.30 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8750, -2.3125,  1.7578,  0.7539, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.0156,  0.0381,  1.5781, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -2.5781,  0.9961,  0.9414, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.9844,  0.4902,  2.2969, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8750, -1.0312,  1.8594, -0.6289, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:22:23,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:22:23,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.03 | bwd_microstep: 35.41 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 34.23 | step_microstep: 2.21
[2025-11-06 18:22:23,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.36 | bwd: 36.39 | bwd_inner: 1.96 | bwd_allreduce: 34.27 | step: 2.30
 44%|████▍     | 1540/3507 [37:37<47:32,  1.45s/it]                                                     {'loss': 0.2873, 'learning_rate': 1.2436941484679974e-05, 'epoch': 0.44}
 44%|████▍     | 1540/3507 [37:37<47:32,  1.45s/it]tensor([[-3.7969, -3.0781, -0.0237,  2.5781, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -3.0625,  0.6133, -0.6914, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -4.4375, -1.5000,  3.0156, -1.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:23,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.92 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -2.1562,  1.9062,  0.5156, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3125, -3.7812,  1.1562, -1.6719, -6.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -3.8281,  1.0547,  1.4609, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -4.1562, -1.0703,  1.2500, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7031,  0.6523,  3.1719, -0.3555, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:24,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 18:22:24,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.15 | bwd_microstep: 173.07 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 172.03 | step_microstep: 2.27
[2025-11-06 18:22:24,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.08 | bwd: 173.92 | bwd_inner: 1.65 | bwd_allreduce: 172.09 | step: 2.37
 44%|████▍     | 1541/3507 [37:38<39:04,  1.19s/it]                                                   {'loss': 0.2276, 'learning_rate': 1.242798166844088e-05, 'epoch': 0.44}
 44%|████▍     | 1541/3507 [37:38<39:04,  1.19s/it]tensor([[-6.3438, -6.0312, -2.7188,  0.9688, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8047, -1.3281, -1.1562,  1.6016,  0.4297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.5000, -2.9688, -0.1299,  0.2480, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:24,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.72 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9688, -3.3594,  0.6328,  1.8281, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -4.5000, -0.3516,  1.0859, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0469, -0.3477,  1.9531, -0.4355, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -3.5938, -2.1250,  1.0391, -1.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.7031, -0.4785,  2.3906, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:24,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:22:24,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.86 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.28
[2025-11-06 18:22:24,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.58 | bwd: 2.68 | bwd_inner: 1.67 | bwd_allreduce: 0.86 | step: 2.38
 44%|████▍     | 1542/3507 [37:38<34:02,  1.04s/it]                                                   {'loss': 0.5802, 'learning_rate': 1.2419019780477985e-05, 'epoch': 0.44}
 44%|████▍     | 1542/3507 [37:38<34:02,  1.04s/it]tensor([[-5.7812, -3.1719,  0.4785, -0.7852, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -2.4219,  1.0625, -0.4453, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -5.4375, -2.0312,  1.2969, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:25,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.70 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.1875, -4.6875, -1.4844,  1.4609, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -0.1177,  2.3281, -1.4219, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -2.7344,  0.2832,  3.1406, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -1.4766,  1.6250, -0.3477, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1250, -3.5156,  1.5391,  1.1484, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:27,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:22:27,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.29 | bwd_microstep: 2311.31 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2310.02 | step_microstep: 1.92
[2025-11-06 18:22:27,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.02 | bwd: 2312.00 | bwd_inner: 1.76 | bwd_allreduce: 2310.07 | step: 2.00
 44%|████▍     | 1543/3507 [37:41<49:41,  1.52s/it]                                                   {'loss': 0.376, 'learning_rate': 1.241005582843821e-05, 'epoch': 0.44}
 44%|████▍     | 1543/3507 [37:41<49:41,  1.52s/it]tensor([[-4.1250, -0.4922,  2.5156, -1.7188, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:27,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.25 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3438, -2.8281, -0.0209,  0.5039, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7031,  1.5469,  3.2344, -0.1660, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -2.5000,  1.0312,  2.4062, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6250,  0.8477,  2.9688, -1.6016, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -0.6094,  2.7344, -1.0859, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -0.4004,  1.4844, -1.6328, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2500, -3.5000, -1.2578,  2.8594, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:29,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:22:29,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.92 | bwd_microstep: 1.88 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.94
[2025-11-06 18:22:29,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 269.19 | bwd: 2.79 | bwd_inner: 1.83 | bwd_allreduce: 0.82 | step: 2.02
 44%|████▍     | 1544/3507 [37:43<51:22,  1.57s/it]                                                   {'loss': 0.4817, 'learning_rate': 1.240108981997022e-05, 'epoch': 0.44}
 44%|████▍     | 1544/3507 [37:43<51:22,  1.57s/it]tensor([[-4.1562, -2.1562,  1.2812,  1.3516, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7188, -3.9844, -1.6406,  2.5156, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.2969,  0.1309,  2.4219, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:29,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.93 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.9258,  1.5469,  2.1875, -0.8867, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.8750, -2.4844,  1.4531, -1.8984, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -5.1875, -1.1250,  1.9219, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7188, -4.2812, -0.1484,  1.6094, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.7344,  0.0598,  2.2969, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:30,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:22:30,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.39 | bwd_microstep: 449.33 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 448.15 | step_microstep: 2.35
[2025-11-06 18:22:30,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 473.34 | bwd: 450.39 | bwd_inner: 2.07 | bwd_allreduce: 448.20 | step: 2.44
 44%|████▍     | 1545/3507 [37:44<45:24,  1.39s/it]                                                   {'loss': 0.3023, 'learning_rate': 1.2392121762724443e-05, 'epoch': 0.44}
 44%|████▍     | 1545/3507 [37:44<45:24,  1.39s/it]tensor([[-4.7188, -2.4062,  1.4922,  1.2656, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:30,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 100.40 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.1250, -4.7812, -1.6719,  1.5703, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -5.1250, -1.5859,  2.0625, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -5.9688, -2.2500,  1.0859, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -3.3750,  0.8906,  1.2422, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.1250,  1.2188,  2.6562, -1.7891, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([3], device='cuda:1')
tensor([[-4.6875, -3.5938, -0.3086,  1.7188, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0625, -4.8750, -0.5156, -0.5586, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:32,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:22:32,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.35 | bwd_microstep: 2.97 | bwd_inner_microstep: 2.08 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.50
[2025-11-06 18:22:32,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.75 | bwd: 3.87 | bwd_inner: 2.90 | bwd_allreduce: 0.85 | step: 2.59
 44%|████▍     | 1546/3507 [37:46<53:20,  1.63s/it]                                                   {'loss': 0.4989, 'learning_rate': 1.2383151664353048e-05, 'epoch': 0.44}
 44%|████▍     | 1546/3507 [37:46<53:20,  1.63s/it]tensor([[-4.4688, -0.8711,  2.4375, -1.3438, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -2.3125,  2.2969, -0.1797, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -3.0156,  1.3047,  0.1758, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5625, -7.0312, -3.1094,  0.3477, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:32,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.57 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7656, -3.9531, -2.3125,  0.8906, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.8125,  0.3848,  2.4688, -1.1875, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.3750, -6.2500, -1.8203,  0.6211, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0625, -4.6562,  0.3457,  0.3848, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:32,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:22:32,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.56 | bwd_microstep: 55.38 | bwd_inner_microstep: 1.70 | bwd_allreduce_microstep: 53.60 | step_microstep: 1.46
[2025-11-06 18:22:32,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.15 | bwd: 56.15 | bwd_inner: 2.39 | bwd_allreduce: 53.63 | step: 1.53
 44%|████▍     | 1547/3507 [37:46<41:21,  1.27s/it]                                                   {'loss': 0.7939, 'learning_rate': 1.2374179532509958e-05, 'epoch': 0.44}
 44%|████▍     | 1547/3507 [37:46<41:21,  1.27s/it]tensor([[-6.0312, -3.5781, -0.6719, -2.0938, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -4.4375, -0.8984,  2.1875, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:32,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.21 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8750, -3.2812,  0.5938,  1.6641, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -0.5469,  1.0391, -1.0312, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -1.7812,  1.5547,  0.4531, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5312, -1.2266,  2.5625, -0.6016, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -2.7969,  0.4805,  1.5625, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5781,  0.4453,  2.9062, -0.5547, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:35,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:22:35,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.16 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.81 | step_microstep: 92.03
[2025-11-06 18:22:35,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.38 | bwd: 2.74 | bwd_inner: 1.78 | bwd_allreduce: 0.84 | step: 92.11
 44%|████▍     | 1548/3507 [37:49<52:57,  1.62s/it]                                                   {'loss': 0.6385, 'learning_rate': 1.2365205374850814e-05, 'epoch': 0.44}
 44%|████▍     | 1548/3507 [37:49<52:57,  1.62s/it]tensor([[-6.1250, -6.0938, -2.9531,  1.0391, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -2.5469,  1.3906,  0.4766, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -4.1562, -1.1953,  2.9375, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -1.3594,  2.7344, -0.7031, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:35,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.16 | bwd_microstep: 4.18 | bwd_inner_microstep: 4.05 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -0.5586,  2.1875, -1.8984, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.9219,  0.4023,  1.8594, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -1.8594,  1.9062,  0.1680, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -2.5156,  1.0312,  0.6484, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:35,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:22:35,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.09 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 2.16
[2025-11-06 18:22:35,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.28 | bwd: 5.92 | bwd_inner: 4.99 | bwd_allreduce: 0.78 | step: 2.24
 44%|████▍     | 1549/3507 [37:49<41:12,  1.26s/it]                                                   {'loss': 0.4047, 'learning_rate': 1.2356229199033008e-05, 'epoch': 0.44}
 44%|████▍     | 1549/3507 [37:49<41:12,  1.26s/it]tensor([[-5.5938, -2.5312,  1.8125,  0.0320, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -2.0938,  1.6562, -0.1357, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:35,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.91 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3125, -1.7188,  2.8125, -0.3887, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -4.5938, -1.3438,  2.0625, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -4.4062,  0.3438,  1.3984, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4375, -5.4062, -1.6016,  0.7773, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1328,  2.2031,  3.5625, -1.2578, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469,  0.8164,  2.5156, -2.0312, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:22:37,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:22:37,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.89 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.65
[2025-11-06 18:22:37,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.83 | bwd: 2.63 | bwd_inner: 1.60 | bwd_allreduce: 0.88 | step: 2.75
 44%|████▍     | 1550/3507 [37:50<41:47,  1.28s/it]                                                   {'loss': 0.7271, 'learning_rate': 1.2347251012715629e-05, 'epoch': 0.44}
 44%|████▍     | 1550/3507 [37:50<41:47,  1.28s/it]tensor([[-3.9219, -3.3750, -0.6953,  1.8906, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -3.1406, -0.3203, -0.0471, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -4.2812, -0.0864,  0.9688, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -4.6250,  0.4062,  2.1250, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:37,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.60 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-7.2188, -6.6875, -2.2344,  1.9375, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6875,  2.1250,  2.8125, -2.7969, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3594,  0.9883,  3.3750, -0.1494, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5312,  1.5859,  3.7969, -1.9297, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:37,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.29 | optimizer_step: 0.26
[2025-11-06 18:22:37,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 84.68 | bwd_microstep: 33.08 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 32.07 | step_microstep: 2.79
[2025-11-06 18:22:37,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.30 | bwd: 34.07 | bwd_inner: 1.79 | bwd_allreduce: 32.12 | step: 2.91
 44%|████▍     | 1551/3507 [37:51<33:03,  1.01s/it]                                                   {'loss': 0.2322, 'learning_rate': 1.2338270823559497e-05, 'epoch': 0.44}
 44%|████▍     | 1551/3507 [37:51<33:03,  1.01s/it]tensor([[-5.1562, -4.0312, -0.2412,  2.0625, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -5.4062, -1.8281,  1.6953, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -2.8281,  1.8047, -0.7891, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:37,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.86 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14
tensor([[-1.2344,  2.2344,  2.5625, -2.7656, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.2812, -1.4375,  2.1719,  0.2451, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5156, -0.6797,  1.0703, -1.8438, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6250,  0.9805,  3.7969, -0.3984, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -2.3438,  1.7734,  0.4629, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:38,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:22:38,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.62 | bwd_microstep: 1.68 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 1.87
[2025-11-06 18:22:38,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 552.52 | bwd: 2.72 | bwd_inner: 1.70 | bwd_allreduce: 0.85 | step: 2.02
 44%|████▍     | 1552/3507 [37:52<38:46,  1.19s/it]                                                   {'loss': 0.5527, 'learning_rate': 1.2329288639227142e-05, 'epoch': 0.44}
 44%|████▍     | 1552/3507 [37:52<38:46,  1.19s/it]tensor([[-0.6602,  2.2344,  2.3750, -1.9219, -1.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4375, -2.6719,  1.1562,  1.8750, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1562, -5.5938, -1.2734,  0.4375, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -1.3438,  2.5781,  0.8164, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6328, -0.0302,  2.7812,  3.0000, -0.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219,  0.1436,  2.2188, -1.0547, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -4.7188, -1.6562,  2.0469, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:42,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.62 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.25
tensor([[-5.1562, -4.0000, -0.4727,  1.1406, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:42,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:22:42,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.52 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.88 | step_microstep: 3.24
[2025-11-06 18:22:42,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 470.14 | bwd: 3.10 | bwd_inner: 2.01 | bwd_allreduce: 0.93 | step: 3.49
 44%|████▍     | 1553/3507 [37:56<1:01:32,  1.89s/it]                                                     {'loss': 0.3253, 'learning_rate': 1.2320304467382786e-05, 'epoch': 0.44}
 44%|████▍     | 1553/3507 [37:56<1:01:32,  1.89s/it]tensor([[-1.7344,  1.1797,  2.5312, -0.7031, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6875, -3.6250,  0.3750,  0.5039, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:42,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.58 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8906, -0.4141,  2.8125, -0.6367, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -2.5312,  0.2559,  1.7266, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -4.4062, -2.2812,  2.1875, -1.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -2.9062,  0.0566,  1.0234, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3750, -3.5156,  1.5078,  0.5742, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219,  0.4258,  3.0469, -1.6016, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:22:42,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:22:42,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.80 | bwd_microstep: 74.21 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 73.13 | step_microstep: 1.85
[2025-11-06 18:22:42,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.41 | bwd: 75.06 | bwd_inner: 1.76 | bwd_allreduce: 73.17 | step: 1.93
 44%|████▍     | 1554/3507 [37:56<47:08,  1.45s/it]                                                     {'loss': 0.4253, 'learning_rate': 1.2311318315692355e-05, 'epoch': 0.44}
 44%|████▍     | 1554/3507 [37:56<47:08,  1.45s/it]tensor([[-4.7812, -4.4375, -0.9688,  2.8906, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -2.5625,  1.0312,  0.8164, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8281, -1.2656,  1.8281,  0.6406, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:43,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.54 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.9844,  0.4199,  2.4531,  0.0601, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -1.9844,  1.8125, -0.3438, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5312, -3.3125, -0.8750,  2.4062, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -3.1094,  0.1206,  2.7344, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -2.8281,  1.3281,  0.6211, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:22:43,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.40 | optimizer_step: 0.30
[2025-11-06 18:22:43,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.47 | bwd_microstep: 220.21 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 219.21 | step_microstep: 2.62
[2025-11-06 18:22:43,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 278.03 | bwd: 221.01 | bwd_inner: 1.57 | bwd_allreduce: 219.28 | step: 2.68
 44%|████▍     | 1555/3507 [37:57<38:13,  1.17s/it]                                                   {'loss': 0.2324, 'learning_rate': 1.2302330191823467e-05, 'epoch': 0.44}
 44%|████▍     | 1555/3507 [37:57<38:13,  1.17s/it]tensor([[-5.5938, -3.3125,  0.8672,  0.7539, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.2188,  1.5781,  1.2734, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:43,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.86 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0000, -3.8281,  0.2754,  0.3145, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -3.5312,  0.6953,  2.2031, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1250, -6.7812, -2.5781, -0.5039, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -3.4531, -0.1123,  2.7031, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.5469,  0.1602,  3.0156, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656,  0.2969,  2.1094, -2.7500, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:45,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:22:45,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.01 | bwd_microstep: 2.06 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.98 | step_microstep: 2.28
[2025-11-06 18:22:45,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 278.82 | bwd: 2.73 | bwd_inner: 1.54 | bwd_allreduce: 1.02 | step: 2.37
 44%|████▍     | 1556/3507 [37:59<43:32,  1.34s/it]                                                   {'loss': 0.3603, 'learning_rate': 1.2293340103445409e-05, 'epoch': 0.44}
 44%|████▍     | 1556/3507 [37:59<43:32,  1.34s/it]tensor([[-5.2812, -2.0469,  2.0938, -0.2441, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -4.4062, -2.5312,  1.3359, -1.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -4.3125, -0.3105,  0.1709, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -1.3438,  0.9258, -2.7188, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:22:45,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.68 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.7812,  1.1641,  3.1875, -2.2188, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -2.7500,  0.5352, -0.2422, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2188, -1.5703, -0.0947,  3.7344,  0.3145]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -4.1875, -0.3223,  1.7188, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:45,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:22:45,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.59 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.95
[2025-11-06 18:22:45,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.29 | bwd: 2.82 | bwd_inner: 1.80 | bwd_allreduce: 0.88 | step: 2.06
 44%|████▍     | 1557/3507 [37:59<37:17,  1.15s/it]                                                   {'loss': 0.634, 'learning_rate': 1.2284348058229158e-05, 'epoch': 0.44}
 44%|████▍     | 1557/3507 [37:59<37:17,  1.15s/it]tensor([[-3.5000, -3.9219, -1.8750,  2.5469, -1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:46,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.19 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.5938, -3.9688, -2.0781,  2.1250, -1.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0469, -1.1641,  0.9141,  4.6562,  0.4727]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -3.4062,  1.4375,  0.2461, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -2.1094,  1.4922,  1.4688, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -2.7656,  0.1406,  3.1719, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.2656,  0.5742,  1.2344, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6875, -2.0312, -0.9570,  2.3125, -0.1318]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:48,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:22:48,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.68 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.89
[2025-11-06 18:22:48,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.89 | bwd: 2.85 | bwd_inner: 1.84 | bwd_allreduce: 0.87 | step: 1.97
 44%|████▍     | 1558/3507 [38:02<55:27,  1.71s/it]                                                   {'loss': 0.2208, 'learning_rate': 1.2275354063847358e-05, 'epoch': 0.44}
 44%|████▍     | 1558/3507 [38:02<55:27,  1.71s/it]tensor([[-5.0938, -2.4844,  2.0938,  1.4141, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062,  0.4766,  2.8438, -1.2891, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438, -3.1562, -1.3359,  2.6406, -0.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:22:49,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.41 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4844, -1.2891,  2.5781,  2.4219, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.8750,  0.6914,  0.5117, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -3.2188, -0.3887,  1.9844, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -4.9688, -1.8281,  1.7969, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5625, -4.8438, -0.0601,  1.4688, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:49,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:22:49,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.58 | bwd_microstep: 14.38 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 13.06 | step_microstep: 2.26
[2025-11-06 18:22:49,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.02 | bwd: 15.18 | bwd_inner: 1.96 | bwd_allreduce: 13.09 | step: 2.34
 44%|████▍     | 1559/3507 [38:03<42:55,  1.32s/it]                                                   {'loss': 0.8217, 'learning_rate': 1.2266358127974312e-05, 'epoch': 0.44}
 44%|████▍     | 1559/3507 [38:03<42:55,  1.32s/it]tensor([[-3.7344,  0.3086,  2.9844, -2.2812, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -0.4023,  2.0469, -0.7656, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0391, -1.4844,  0.3672,  4.7500,  0.6133]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -5.2812, -0.5781,  1.8203, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -2.0938,  1.2891,  0.4336, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:49,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.03 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7812, -3.5156,  0.0133,  1.8594, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -3.9219, -0.3066,  2.3594, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -3.8125,  0.4102, -0.8320, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:51,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.22 | optimizer_step: 0.23
[2025-11-06 18:22:51,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.85 | bwd_microstep: 2.26 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 0.94 | step_microstep: 2.25
[2025-11-06 18:22:51,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 490.89 | bwd: 3.24 | bwd_inner: 2.13 | bwd_allreduce: 0.97 | step: 2.32
 44%|████▍     | 1560/3507 [38:05<48:12,  1.49s/it]                                                   {'loss': 0.261, 'learning_rate': 1.2257360258285981e-05, 'epoch': 0.44}
 44%|████▍     | 1560/3507 [38:05<48:12,  1.49s/it]tensor([[-3.2812, -3.8438, -1.8906,  2.4219, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.1250, -3.0781,  0.8125,  0.9180, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.4688,  2.0156,  1.1875, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:51,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.39 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.6875, -2.4219,  2.1875, -0.1895, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -3.9688, -1.0469,  2.9531, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -3.9219, -0.4766,  2.7188, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8594, -2.1719, -0.1147,  1.8359, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6406,  0.9766,  3.9219,  0.0261, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:22:51,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:22:51,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.28 | bwd_microstep: 73.48 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 72.31 | step_microstep: 1.52
[2025-11-06 18:22:51,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.70 | bwd: 74.39 | bwd_inner: 1.93 | bwd_allreduce: 72.34 | step: 1.59
 45%|████▍     | 1561/3507 [38:05<38:08,  1.18s/it]                                                   {'loss': 0.6419, 'learning_rate': 1.2248360462459979e-05, 'epoch': 0.45}
 45%|████▍     | 1561/3507 [38:05<38:08,  1.18s/it]tensor([[-3.2344, -0.0132,  2.4688, -1.3438, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -4.5000, -1.3516,  1.7578, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.7188,  0.6914,  3.2188, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -1.6406,  2.4844, -2.1094, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -5.5000, -1.2891,  2.4531, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:52,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.70 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.4219, -0.5586,  2.8281,  3.3438, -1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0625, -3.7031, -1.0625,  1.9375, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -5.3125, -1.6562,  1.0938, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:53,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:22:53,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.00 | bwd_microstep: 696.73 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 695.87 | step_microstep: 2.54
[2025-11-06 18:22:53,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.73 | bwd: 697.67 | bwd_inner: 1.59 | bwd_allreduce: 695.93 | step: 2.62
 45%|████▍     | 1562/3507 [38:07<47:30,  1.47s/it]                                                   {'loss': 0.1157, 'learning_rate': 1.2239358748175556e-05, 'epoch': 0.45}
 45%|████▍     | 1562/3507 [38:07<47:30,  1.47s/it]tensor([[-2.8281,  0.3242,  2.3750, -0.9805, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9375,  1.4453,  2.3750, -2.1094, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:53,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.34 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2500, -4.4062, -1.8047,  2.3125, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -3.6562, -2.3438,  1.0547, -1.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2812,  1.4688,  3.2812, -1.5312, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4062, -1.7578,  2.3438, -1.1875, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -1.3203,  2.6406,  0.7969, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8672,  0.2559,  2.8906,  2.0625, -1.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:22:54,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:22:54,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.71 | bwd_microstep: 177.30 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 176.17 | step_microstep: 2.14
[2025-11-06 18:22:54,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.04 | bwd: 178.08 | bwd_inner: 1.67 | bwd_allreduce: 176.22 | step: 2.24
 45%|████▍     | 1563/3507 [38:08<38:10,  1.18s/it]                                                   {'loss': 0.2895, 'learning_rate': 1.2230355123113612e-05, 'epoch': 0.45}
 45%|████▍     | 1563/3507 [38:08<38:10,  1.18s/it]tensor([[-2.7812, -3.0781, -1.7969,  1.4766, -1.0547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6172,  2.5312,  2.8281, -1.7812, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.3125, -3.4375, -0.6406,  3.7031, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -3.7969,  0.1729,  0.7461, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -0.5039,  2.4219, -3.2188, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:55,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.22 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.9375, -4.9688, -1.9531,  2.4062, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4844, -3.1094, -0.6523,  2.5469, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -3.5000,  0.4531,  2.2031, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:22:57,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.23 | optimizer_step: 0.34
[2025-11-06 18:22:57,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.24 | bwd_microstep: 2182.91 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2181.77 | step_microstep: 2.40
[2025-11-06 18:22:57,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.48 | bwd: 2183.71 | bwd_inner: 1.71 | bwd_allreduce: 2181.84 | step: 2.49
 45%|████▍     | 1564/3507 [38:11<57:48,  1.79s/it]                                                   {'loss': 0.7699, 'learning_rate': 1.2221349594956664e-05, 'epoch': 0.45}
 45%|████▍     | 1564/3507 [38:11<57:48,  1.79s/it]tensor([[-3.5625, -3.3750, -0.6445,  2.9219, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -2.8125,  0.9023,  0.0544, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:22:57,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.39 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.5469, -3.8438, -1.6875,  2.7969, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-7.1250, -4.9688, -0.9883, -0.8281, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9219,  1.0078,  3.3281, -1.4766, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -2.8438,  0.9141,  2.3594, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8906,  1.6953,  3.1250, -1.6172, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -2.7188,  1.8203, -0.6992, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:57,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:22:57,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.55 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.76
[2025-11-06 18:22:57,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.96 | bwd: 2.49 | bwd_inner: 1.47 | bwd_allreduce: 0.85 | step: 1.85
 45%|████▍     | 1565/3507 [38:11<44:34,  1.38s/it]                                                   {'loss': 0.854, 'learning_rate': 1.221234217138886e-05, 'epoch': 0.45}
 45%|████▍     | 1565/3507 [38:11<44:34,  1.38s/it]tensor([[-6.3125, -4.6562,  0.3242,  2.2969, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.6328,  1.8984,  1.7578, -1.3828, -1.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5312, -2.5938,  0.7617,  1.0078, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438,  0.2617,  2.7969, -1.4297, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0938,  1.6562,  2.9688, -2.2031, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:22:58,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.88 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9844, -4.0938, -1.9688,  1.6719, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0938, -3.2500, -1.2812,  2.5938, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -2.1719,  1.3516,  0.3008, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:23:00,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:23:00,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.49 | bwd_microstep: 2210.94 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2209.68 | step_microstep: 2.58
[2025-11-06 18:23:00,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.38 | bwd: 2211.90 | bwd_inner: 2.02 | bwd_allreduce: 2209.72 | step: 2.66
 45%|████▍     | 1566/3507 [38:14<1:00:46,  1.88s/it]                                                     {'loss': 0.2786, 'learning_rate': 1.2203332860095967e-05, 'epoch': 0.45}
 45%|████▍     | 1566/3507 [38:14<1:00:46,  1.88s/it]tensor([[-6.9375, -4.2188,  0.2207, -0.6680, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -1.9922,  2.0625,  1.3594, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:01,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.94 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0000, -2.7812,  0.4336,  1.8828, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -4.5000, -0.8828,  1.6641, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688, -0.7852,  2.1875,  0.3340, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.7500, -7.2500, -2.0312,  0.4180, -5.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5078,  1.4062,  1.6797, -2.0000, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.6719, -2.9688, -0.0219,  2.5156, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:23:01,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:23:01,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.28 | bwd_microstep: 57.01 | bwd_inner_microstep: 1.53 | bwd_allreduce_microstep: 55.41 | step_microstep: 2.15
[2025-11-06 18:23:01,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.24 | bwd: 57.79 | bwd_inner: 2.23 | bwd_allreduce: 55.44 | step: 2.22
 45%|████▍     | 1567/3507 [38:15<46:45,  1.45s/it]                                                     {'loss': 0.7919, 'learning_rate': 1.2194321668765357e-05, 'epoch': 0.45}
 45%|████▍     | 1567/3507 [38:15<46:45,  1.45s/it][h264 @ 0xb49f380] mmco: unref short failure
[h264 @ 0xb49f380] mmco: unref short failure
tensor([[-2.1562,  1.1875,  3.5469, -0.0593, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -0.2051,  1.6250, -2.5781, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:01,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.00 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2500, -3.0781,  1.2656,  1.6719, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -3.0469,  1.2422,  0.0242, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1250, -0.6523,  2.4844, -1.0078, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -1.1484,  1.7031,  0.9961, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438,  0.1040,  2.1406, -0.8359, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3750,  1.3672,  2.5312, -2.6094, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:23:03,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:23:03,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.28 | bwd_microstep: 1798.54 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1797.46 | step_microstep: 1.85
[2025-11-06 18:23:03,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.31 | bwd: 1799.34 | bwd_inner: 1.70 | bwd_allreduce: 1797.50 | step: 1.92
 45%|████▍     | 1568/3507 [38:17<53:48,  1.67s/it]                                                   {'loss': 0.3102, 'learning_rate': 1.2185308605086004e-05, 'epoch': 0.45}
 45%|████▍     | 1568/3507 [38:17<53:48,  1.67s/it]tensor([[-1.6172,  1.6719,  2.0156, -2.7031, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-8.1250, -7.0000, -2.2969,  0.6172, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:03,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.36 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9375, -4.6875, -1.3984,  2.4375, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.5312,  0.8203,  2.0781, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -2.3906,  0.3984,  1.1016, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1406, -1.2266,  1.5547,  3.6719, -0.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -4.2812, -1.4375,  2.0312, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938, -2.2656,  1.1172,  2.6719, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:04,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:23:04,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.58 | bwd_microstep: 1.83 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.53
[2025-11-06 18:23:04,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.98 | bwd: 2.84 | bwd_inner: 1.92 | bwd_allreduce: 0.78 | step: 1.62
 45%|████▍     | 1569/3507 [38:17<41:53,  1.30s/it]                                                   {'loss': 0.3495, 'learning_rate': 1.2176293676748494e-05, 'epoch': 0.45}
 45%|████▍     | 1569/3507 [38:17<41:53,  1.30s/it]tensor([[-4.5938, -1.8672,  1.7344,  0.1475, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -2.5312,  1.3203,  0.9297, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:04,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.58 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.4062,  2.2188,  3.0156, -2.3438, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.2969, -0.0583,  2.3750, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -3.9531, -0.4121, -0.0476, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5938, -0.5859,  2.5625,  0.2334, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4531, -2.6406, -0.2734,  1.3828, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -4.6875, -0.7773,  1.9531, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:06,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.26 | optimizer_step: 0.37
[2025-11-06 18:23:06,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.30 | bwd_microstep: 1870.86 | bwd_inner_microstep: 11.80 | bwd_allreduce_microstep: 1858.85 | step_microstep: 3.06
[2025-11-06 18:23:06,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.92 | bwd: 1871.80 | bwd_inner: 12.67 | bwd_allreduce: 1858.89 | step: 3.15
 45%|████▍     | 1570/3507 [38:20<51:38,  1.60s/it]                                                   {'loss': 0.2768, 'learning_rate': 1.2167276891444986e-05, 'epoch': 0.45}
 45%|████▍     | 1570/3507 [38:20<51:38,  1.60s/it]tensor([[-3.2188, -3.5000, -1.5312,  2.3906, -1.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -1.3438,  1.1250, -1.5547, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.5312,  0.4238,  1.5781, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:06,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.37 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3750,  0.8945,  3.2812, -0.1641, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.4531, -0.2500,  2.0469, -1.1172, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -5.9375, -1.9453,  1.9609, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2031, -1.4453,  1.8281,  2.0312, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -3.3750,  0.1953,  2.3438, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:06,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 16.77 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:23:06,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.15 | bwd_microstep: 7.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 7.03 | step_microstep: 18.32
[2025-11-06 18:23:06,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.56 | bwd: 8.67 | bwd_inner: 1.44 | bwd_allreduce: 7.08 | step: 18.41
 45%|████▍     | 1571/3507 [38:20<40:25,  1.25s/it]                                                   {'loss': 0.4896, 'learning_rate': 1.2158258256869238e-05, 'epoch': 0.45}
 45%|████▍     | 1571/3507 [38:20<40:25,  1.25s/it]tensor([[-6.8125, -5.4688, -0.9453,  1.2656, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -5.3438, -1.4688,  1.7109, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -3.7031, -0.9023,  1.9688, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:06,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.42 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4062,  1.0078,  2.8438, -1.5859, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2969,  1.1797,  0.9805, -2.1250, -1.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-6.3750, -3.5469,  1.2969,  0.2832, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.6250,  0.5742,  1.9922, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.1406,  1.6641,  0.4238, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:23:11,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:23:11,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.90 | bwd_microstep: 3972.70 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 3971.87 | step_microstep: 1.90
[2025-11-06 18:23:11,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.35 | bwd: 3973.43 | bwd_inner: 1.37 | bwd_allreduce: 3971.91 | step: 1.98
 45%|████▍     | 1572/3507 [38:24<1:10:32,  2.19s/it]                                                     {'loss': 0.2383, 'learning_rate': 1.2149237780716575e-05, 'epoch': 0.45}
 45%|████▍     | 1572/3507 [38:24<1:10:32,  2.19s/it]tensor([[-4.2500, -2.5469,  0.9453,  1.9141, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:11,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.09 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6094, -0.2773,  2.1719,  1.1562, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1406, -2.5000, -0.2236,  1.8125, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -4.9062, -0.7891,  0.4648, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -2.4688,  1.2266,  2.4531, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -3.1875, -0.0251,  1.1250, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5625,  0.3105,  2.9844,  0.3262, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5625, -2.9375,  0.0138,  2.9219, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:11,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:23:11,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.32 | bwd_microstep: 103.87 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 102.81 | step_microstep: 2.19
[2025-11-06 18:23:11,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.43 | bwd: 104.74 | bwd_inner: 1.77 | bwd_allreduce: 102.84 | step: 2.26
 45%|████▍     | 1573/3507 [38:25<53:49,  1.67s/it]                                                     {'loss': 0.5376, 'learning_rate': 1.21402154706839e-05, 'epoch': 0.45}
 45%|████▍     | 1573/3507 [38:25<53:49,  1.67s/it]tensor([[-5.0000, -2.1562,  1.8047,  0.3418, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4844,  1.3672,  2.8594, -2.3906, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:11,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.85 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.4688, -2.9219, -1.0781,  3.0000, -0.5664]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.1719,  1.8516,  2.8281, -1.2188, -1.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[3.0312, 3.7656, 4.5938, 5.9375, 3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -1.9609,  1.2422, -0.2578, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6875, -5.5312, -0.2373,  0.8750, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -2.8281,  1.4766,  1.5078, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:23:14,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:23:14,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.49 | bwd_microstep: 2431.40 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2430.28 | step_microstep: 2.38
[2025-11-06 18:23:14,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.37 | bwd: 2432.15 | bwd_inner: 1.70 | bwd_allreduce: 2430.32 | step: 2.45
 45%|████▍     | 1574/3507 [38:28<1:04:22,  2.00s/it]                                                     {'loss': 1.2673, 'learning_rate': 1.213119133446968e-05, 'epoch': 0.45}
 45%|████▍     | 1574/3507 [38:28<1:04:22,  2.00s/it]tensor([[-6.0000, -3.4531,  1.2031,  0.7617, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -4.2500, -1.1562,  1.0938, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1719, -3.4688, -1.5547,  2.3594, -1.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:23:14,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.10 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.3438, -2.5156,  0.8164,  1.0312, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -4.4688, -2.3906,  2.5625, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594,  0.2656,  1.6641, -2.0156, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -3.7031, -0.7969,  2.6250, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.4922,  1.9766,  2.8594, -2.1875, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:14,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:23:14,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 295.92 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.16
[2025-11-06 18:23:14,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.04 | bwd: 2.92 | bwd_inner: 1.86 | bwd_allreduce: 0.93 | step: 2.25
 45%|████▍     | 1575/3507 [38:28<49:52,  1.55s/it]                                                     {'loss': 0.2763, 'learning_rate': 1.212216537977394e-05, 'epoch': 0.45}
 45%|████▍     | 1575/3507 [38:28<49:52,  1.55s/it]tensor([[-4.0000, -2.1719,  1.5312,  2.5312, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:15,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.12 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.8125, -4.6250, -0.9141,  1.1328, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -3.7031,  0.3066,  2.1094, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3906, -2.9531, -1.4297,  2.6562, -0.4727]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -3.6719, -0.2119,  2.7812, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8438, -4.5000, -0.3359, -0.6250, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -0.6445,  3.0312, -0.5352, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0625, -5.0312, -0.9180,  1.9688, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:23:16,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:23:16,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.85 | bwd_microstep: 1680.92 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1679.70 | step_microstep: 2.14
[2025-11-06 18:23:16,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.97 | bwd: 1681.86 | bwd_inner: 2.00 | bwd_allreduce: 1679.74 | step: 2.21
 45%|████▍     | 1576/3507 [38:30<54:32,  1.69s/it]                                                   {'loss': 0.1787, 'learning_rate': 1.2113137614298253e-05, 'epoch': 0.45}
 45%|████▍     | 1576/3507 [38:30<54:32,  1.69s/it]tensor([[-5.5312, -2.6094,  2.2344,  1.0234, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5000, -4.9375,  0.4980,  0.6289, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -5.5000, -1.7734,  0.8750, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4844, -0.8164,  2.1719,  2.9219, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:17,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.57 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.9688,  1.1172,  2.3594, -3.4219, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -2.7969,  0.8281,  1.7188, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -3.9375, -0.5547,  3.1875, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -0.6875,  3.1719, -1.7188, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:23:17,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:23:17,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.56 | bwd_microstep: 34.74 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 33.67 | step_microstep: 1.67
[2025-11-06 18:23:17,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.15 | bwd: 35.61 | bwd_inner: 1.76 | bwd_allreduce: 33.71 | step: 1.76
 45%|████▍     | 1577/3507 [38:31<42:31,  1.32s/it]                                                   {'loss': 0.4641, 'learning_rate': 1.2104108045745746e-05, 'epoch': 0.45}
 45%|████▍     | 1577/3507 [38:31<42:31,  1.32s/it]tensor([[-5.7812, -2.7656,  0.6836, -1.5938, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.2812,  0.5938,  1.5078, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:17,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.35 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.2812, -0.4551,  1.8906, -0.7891, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5391,  2.1562,  2.8281, -0.5117, -0.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -4.4688, -1.6484,  2.7188, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9062, -3.4219, -2.1562,  1.6016, -1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -3.1719,  0.1133,  1.7891, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3438,  0.9414,  3.6875,  0.4824, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:23:19,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:23:19,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.48 | bwd_microstep: 1351.34 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1350.10 | step_microstep: 2.42
[2025-11-06 18:23:19,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 248.85 | bwd: 1352.04 | bwd_inner: 1.74 | bwd_allreduce: 1350.14 | step: 2.51
 45%|████▍     | 1578/3507 [38:32<45:37,  1.42s/it]                                                   {'loss': 0.1925, 'learning_rate': 1.2095076681821068e-05, 'epoch': 0.45}
 45%|████▍     | 1578/3507 [38:32<45:37,  1.42s/it]tensor([[-4.8750, -3.6406, -0.5859,  0.6211, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -3.6719, -1.0859,  2.9531, -1.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8125, -1.9609,  1.0938,  3.6875, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:19,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 199.43 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.1250, -5.9062, -2.4531,  1.6406, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531,  0.5430,  3.5938, -1.7031, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2031, -1.0391,  0.6953,  1.5234, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.4688,  0.7422,  1.2734, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4688, -3.5625,  1.4375,  0.3945, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:19,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:23:19,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.63 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.64
[2025-11-06 18:23:19,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 509.07 | bwd: 3.07 | bwd_inner: 2.07 | bwd_allreduce: 0.88 | step: 1.73
 45%|████▌     | 1579/3507 [38:33<37:17,  1.16s/it]                                                   {'loss': 0.3707, 'learning_rate': 1.2086043530230421e-05, 'epoch': 0.45}
 45%|████▌     | 1579/3507 [38:33<37:17,  1.16s/it]tensor([[-6.0938, -5.7188, -1.8828,  2.0312, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4531,  0.5586,  2.4844, -0.3828, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:19,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.54 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.7812, -4.4688,  0.4648, -1.6484, -6.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.8438, -0.9297,  2.5312, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -2.8125,  0.6797,  2.4219, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125, -2.0312,  0.5000,  3.7344, -0.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -4.2188, -0.3242,  0.3691, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -4.5938, -2.1719,  2.2188, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:23:22,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:23:22,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.12 | bwd_microstep: 2246.05 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 2244.87 | step_microstep: 2.28
[2025-11-06 18:23:22,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.69 | bwd: 2247.08 | bwd_inner: 2.03 | bwd_allreduce: 2244.90 | step: 2.36
 45%|████▌     | 1580/3507 [38:36<51:39,  1.61s/it]                                                   {'loss': 0.1334, 'learning_rate': 1.2077008598681515e-05, 'epoch': 0.45}
 45%|████▌     | 1580/3507 [38:36<51:39,  1.61s/it]tensor([[-3.5938, -1.9297,  0.6094,  0.9766, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9375, -3.4531,  1.7344, -0.7266, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -4.1875, -0.7578,  2.1406, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -3.9219, -0.0986,  0.9023, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:22,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.51 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0938, -2.8281,  0.3828,  1.5469, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -4.4375, -1.7422,  2.3594, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -1.4375,  1.3125, -0.0201, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8281, -1.0859,  1.9922, -0.0194, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:23:22,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:23:22,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.67 | bwd_microstep: 22.57 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 21.60 | step_microstep: 1.70
[2025-11-06 18:23:22,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.23 | bwd: 23.39 | bwd_inner: 1.59 | bwd_allreduce: 21.64 | step: 1.80
 45%|████▌     | 1581/3507 [38:36<40:05,  1.25s/it]                                                   {'loss': 0.445, 'learning_rate': 1.206797189488359e-05, 'epoch': 0.45}
 45%|████▌     | 1581/3507 [38:36<40:05,  1.25s/it]tensor([[-4.2500, -3.5312, -0.0405,  3.0312, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:22,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.85 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2812, -2.9219,  0.4238,  1.5547, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0312, -3.7812,  0.9883,  1.3672, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1484,  2.2188,  2.7812, -2.0938, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5312, -3.1562,  0.0200,  3.5938, -1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -0.7617,  2.0625,  0.0525, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0625, -5.0625, -1.1328,  1.1406, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -4.1875, -0.4805,  1.5781, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:23:23,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:23:23,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.83 | bwd_microstep: 583.46 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 582.23 | step_microstep: 1.88
[2025-11-06 18:23:23,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.70 | bwd: 584.30 | bwd_inner: 1.85 | bwd_allreduce: 582.28 | step: 1.97
 45%|████▌     | 1582/3507 [38:37<37:21,  1.16s/it]                                                   {'loss': 0.2741, 'learning_rate': 1.205893342654739e-05, 'epoch': 0.45}
 45%|████▌     | 1582/3507 [38:37<37:21,  1.16s/it]tensor([[-7.5000, -5.8125, -2.0625, -1.4609, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688, -2.8906, -1.2188,  2.7656, -0.5977]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -3.7031, -0.5078,  4.0625, -1.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -3.8594, -0.6445,  2.3281, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:24,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.27 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4375, -2.8906,  0.5898,  1.5078, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219,  0.0596,  3.2969, -1.4219, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4453,  1.5469,  2.0781, -1.2891, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.8281, -0.6133,  1.2109, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:26,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:23:26,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.86 | bwd_microstep: 2008.38 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2007.26 | step_microstep: 2.14
[2025-11-06 18:23:26,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.13 | bwd: 2009.24 | bwd_inner: 1.80 | bwd_allreduce: 2007.30 | step: 2.21
 45%|████▌     | 1583/3507 [38:40<53:38,  1.67s/it]                                                   {'loss': 0.286, 'learning_rate': 1.204989320138517e-05, 'epoch': 0.45}
 45%|████▌     | 1583/3507 [38:40<53:38,  1.67s/it]tensor([[-7.4688, -4.3438,  0.8828, -0.5547, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:26,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.74 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7969, -4.1250, -1.8047,  2.4531, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.5469,  0.1021,  2.6406, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -2.3125,  1.6250, -0.3574, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -3.4219,  0.2324,  2.0469, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594,  0.0540,  3.3906,  1.3516, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5781,  0.6523,  2.6406, -1.4219, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8984, -2.2812, -0.4473,  3.7500, -0.1001]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:23:27,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:23:27,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.37 | bwd_microstep: 290.77 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 289.71 | step_microstep: 1.58
[2025-11-06 18:23:27,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.13 | bwd: 291.77 | bwd_inner: 1.89 | bwd_allreduce: 289.75 | step: 1.67
 45%|████▌     | 1584/3507 [38:40<43:10,  1.35s/it]                                                   {'loss': 0.3721, 'learning_rate': 1.2040851227110681e-05, 'epoch': 0.45}
 45%|████▌     | 1584/3507 [38:40<43:10,  1.35s/it]tensor([[-4.2500e+00, -3.6250e+00, -1.7776e-03,  3.3281e+00, -2.1094e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.2969,  0.1680,  1.9219, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125, -0.0233,  2.3125,  0.2285, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4062, -1.9062,  2.4375,  2.1094, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -1.2734,  2.9688,  0.1709, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -2.5312,  1.9531,  0.2637, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:27,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.90 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6875, -4.0312,  1.0391,  0.5781, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -4.1875, -1.4922,  2.3281, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:32,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:23:32,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.14 | bwd_microstep: 4304.98 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 4304.10 | step_microstep: 2.28
[2025-11-06 18:23:32,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 550.07 | bwd: 4305.87 | bwd_inner: 1.61 | bwd_allreduce: 4304.13 | step: 2.38
 45%|████▌     | 1585/3507 [38:46<1:20:18,  2.51s/it]                                                     {'loss': 0.5128, 'learning_rate': 1.2031807511439176e-05, 'epoch': 0.45}
 45%|████▌     | 1585/3507 [38:46<1:20:18,  2.51s/it]tensor([[-4.0938, -1.3438,  2.2188,  0.7383, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7930,  3.0781,  3.5938, -2.2344, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:32,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.69 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1250, -2.7969,  0.7930,  2.6562, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.3906,  0.0236, -0.4121, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -4.5625, -0.9922,  1.7266, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -0.4902,  1.6172, -2.8438, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9219,  0.3906,  4.0000, -1.4375, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469,  0.6328,  3.1094, -2.9531, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:23:32,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:23:32,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.69 | bwd_microstep: 349.67 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 348.76 | step_microstep: 1.70
[2025-11-06 18:23:32,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.41 | bwd: 350.52 | bwd_inner: 1.58 | bwd_allreduce: 348.80 | step: 1.77
 45%|████▌     | 1586/3507 [38:46<1:02:35,  1.96s/it]                                                     {'loss': 0.7169, 'learning_rate': 1.2022762062087372e-05, 'epoch': 0.45}
 45%|████▌     | 1586/3507 [38:46<1:02:35,  1.96s/it]tensor([[-2.4219,  0.5469,  2.3906, -0.9492, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -1.1797,  2.1094, -2.0312, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -1.2891,  1.9062,  0.2461, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:33,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.08 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0625, -3.4219,  0.5000,  1.8594, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -3.0000,  1.1953,  0.1592, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -3.5312,  0.0376,  2.5625, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -2.8281,  0.9219,  1.8672, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -3.6562,  0.8398,  1.0547, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:33,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:23:33,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.68 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.64 | step_microstep: 1.46
[2025-11-06 18:23:33,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.77 | bwd: 2.66 | bwd_inner: 1.87 | bwd_allreduce: 0.67 | step: 1.53
 45%|████▌     | 1587/3507 [38:47<47:29,  1.48s/it]                                                     {'loss': 0.2556, 'learning_rate': 1.2013714886773492e-05, 'epoch': 0.45}
 45%|████▌     | 1587/3507 [38:47<47:29,  1.48s/it]tensor([[-3.3750, -2.8750,  0.3281,  3.7812, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:33,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.60 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5156, -4.1562, -2.1562,  2.5000, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -2.5781,  0.3066,  2.3750, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -3.2969,  1.2500,  1.1953, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -0.7969,  2.1562, -0.1377, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.0625, -7.3125, -2.9375,  0.4980, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9062, -4.5625,  0.7422,  1.1562, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -1.3516,  2.7812, -2.2656, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:33,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:23:33,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.78 | bwd_microstep: 1.95 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.99
[2025-11-06 18:23:33,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 487.42 | bwd: 2.87 | bwd_inner: 1.80 | bwd_allreduce: 0.91 | step: 2.09
 45%|████▌     | 1588/3507 [38:47<38:21,  1.20s/it]                                                   {'loss': 0.2081, 'learning_rate': 1.200466599321721e-05, 'epoch': 0.45}
 45%|████▌     | 1588/3507 [38:47<38:21,  1.20s/it]tensor([[-3.2031, -2.2969, -0.0339,  1.5391, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:33,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.69 | bwd_microstep: 1.43 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.5312, -4.1562, -0.8516,  2.7188, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -5.4375, -2.3281,  2.0781, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -4.5000, -1.2031,  1.7969, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7109,  0.5859,  1.7969, -0.2314, -1.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.5312,  0.0767,  2.2812, -2.4844, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.7344, -0.5742,  2.3125, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -0.9844,  2.6719, -2.0938, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:23:34,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.21 | optimizer_step: 0.24
[2025-11-06 18:23:34,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.03 | bwd_microstep: 85.06 | bwd_inner_microstep: 9.53 | bwd_allreduce_microstep: 75.41 | step_microstep: 2.13
[2025-11-06 18:23:34,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.74 | bwd: 86.49 | bwd_inner: 10.82 | bwd_allreduce: 75.47 | step: 2.24
 45%|████▌     | 1589/3507 [38:48<30:54,  1.03it/s]                                                   {'loss': 0.4697, 'learning_rate': 1.1995615389139679e-05, 'epoch': 0.45}
 45%|████▌     | 1589/3507 [38:48<30:54,  1.03it/s]tensor([[-2.4375,  1.1250,  3.4531, -0.5703, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -3.0781,  0.6797,  1.2500, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250,  0.0386,  2.6094, -1.5703, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -2.7188,  1.5547, -1.5234, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:35,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.22 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.4141,  2.0000,  3.0156, -1.3359, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -2.4219,  1.9453,  0.3984, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1250, -3.4844,  0.8672,  0.2402, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -4.9375, -1.5859,  2.2969, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:23:37,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 18:23:37,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.18 | bwd_microstep: 150.28 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 149.07 | step_microstep: 3.06
[2025-11-06 18:23:37,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.41 | bwd: 151.01 | bwd_inner: 1.73 | bwd_allreduce: 149.12 | step: 3.14
 45%|████▌     | 1590/3507 [38:51<55:52,  1.75s/it]                                                   {'loss': 0.2166, 'learning_rate': 1.1986563082263506e-05, 'epoch': 0.45}
 45%|████▌     | 1590/3507 [38:51<55:52,  1.75s/it]tensor([[-6.5312, -2.7500,  1.8984, -1.4609, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -0.5039,  2.3125, -2.0625, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -4.5000, -0.6094,  0.9062, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -0.5625,  2.8281, -1.2812, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:38,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.17 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.8750, -5.8438, -0.4863,  0.8203, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7930,  2.2812,  2.8125, -1.2188, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.7969, -0.3223,  2.6719, -1.3203, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0312, -1.8125,  1.0312,  2.4062, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:38,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:23:38,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.49 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.94
[2025-11-06 18:23:38,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 510.70 | bwd: 2.32 | bwd_inner: 1.35 | bwd_allreduce: 0.83 | step: 2.02
 45%|████▌     | 1591/3507 [38:52<44:29,  1.39s/it]                                                   {'loss': 0.2467, 'learning_rate': 1.1977509080312755e-05, 'epoch': 0.45}
 45%|████▌     | 1591/3507 [38:52<44:29,  1.39s/it]tensor([[-3.9688, -1.4766,  1.8125,  0.4844, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594,  0.0762,  2.0469, -1.8906, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -4.8438, -1.6875,  2.8125, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[4.3438, 5.2188, 6.2500, 7.4375, 4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -3.0312,  2.1250,  0.6484, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:39,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.94 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.5000, -6.0625, -2.0625,  1.9688, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.4062, -5.2812, -0.9414, -0.7617, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -3.7500, -0.2969,  2.5781, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:23:42,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:23:42,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.80 | bwd_microstep: 392.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 392.03 | step_microstep: 2.79
[2025-11-06 18:23:42,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.77 | bwd: 393.97 | bwd_inner: 1.68 | bwd_allreduce: 392.09 | step: 2.90
 45%|████▌     | 1592/3507 [38:55<1:07:13,  2.11s/it]                                                     {'loss': 0.5578, 'learning_rate': 1.1968453391012928e-05, 'epoch': 0.45}
 45%|████▌     | 1592/3507 [38:55<1:07:13,  2.11s/it]tensor([[-3.7969,  0.1592,  3.7656, -0.5039, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -3.3438, -0.5391,  2.9219, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-9.5000, -6.2188, -1.9844, -4.3750, -7.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:42,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.32 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8438, -1.1406,  2.9219, -0.7305, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8281, -1.8828, -0.2314,  2.8750, -0.3574]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -3.3438,  1.3281,  2.7500, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -3.0781,  1.2109,  0.4375, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0938, -3.6406,  1.0000,  0.7734, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:23:42,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:23:42,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 201.12 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 200.36 | step_microstep: 1.82
[2025-11-06 18:23:42,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.35 | bwd: 201.74 | bwd_inner: 1.20 | bwd_allreduce: 200.40 | step: 1.90
 45%|████▌     | 1593/3507 [38:56<52:44,  1.65s/it]                                                     {'loss': 0.7259, 'learning_rate': 1.1959396022090984e-05, 'epoch': 0.45}
 45%|████▌     | 1593/3507 [38:56<52:44,  1.65s/it]tensor([[-1.0156,  1.9688,  1.8984, -1.8359, -1.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
[2025-11-06 18:23:42,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.09 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7656, -2.2031,  0.8594,  1.4297, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -1.0938,  1.8594, -1.2344, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -1.3125,  1.7656, -0.5742, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.9844, -1.2109,  2.3594, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.4062,  0.1621,  2.8438, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750, -2.8438, -0.5391,  3.2969, -0.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1250,  0.5703,  3.5000, -0.7539, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:45,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:23:45,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.42 | bwd_microstep: 1.68 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.89
[2025-11-06 18:23:45,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 259.53 | bwd: 2.37 | bwd_inner: 1.40 | bwd_allreduce: 0.82 | step: 2.98
 45%|████▌     | 1594/3507 [38:59<1:03:00,  1.98s/it]                                                     {'loss': 0.5781, 'learning_rate': 1.1950336981275287e-05, 'epoch': 0.45}
 45%|████▌     | 1594/3507 [38:59<1:03:00,  1.98s/it]tensor([[-6.0312, -5.0625, -1.1484,  1.3047, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -2.2500,  1.1016,  1.1641, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -2.6875,  1.1250, -0.1030, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:45,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3594,  0.9219,  3.0938, -3.1094, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -3.1875,  0.6094,  0.2451, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.0781,  0.4902,  1.2656, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -1.3203,  3.3125, -0.6484, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.2500, -0.4727,  1.0859, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:46,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:23:46,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.57 | bwd_microstep: 350.06 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 349.05 | step_microstep: 1.61
[2025-11-06 18:23:46,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.03 | bwd: 350.91 | bwd_inner: 1.68 | bwd_allreduce: 349.10 | step: 1.69
 45%|████▌     | 1595/3507 [39:00<51:22,  1.61s/it]                                                     {'loss': 0.4387, 'learning_rate': 1.1941276276295659e-05, 'epoch': 0.45}
 45%|████▌     | 1595/3507 [39:00<51:22,  1.61s/it]tensor([[-4.8438, -3.7500,  0.0167,  2.3594, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:46,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.52 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5781, -1.3203,  0.8906, -0.8047, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.1875,  0.1572,  2.7188, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -5.0938, -1.3359,  1.9609, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -0.9180,  2.9375, -0.8594, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062, -3.7344, -1.6797,  2.4531, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.4844, -0.7461,  1.2812, -1.2422, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -1.8516,  2.8438, -0.2812, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:23:48,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:23:48,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.54 | bwd_microstep: 767.83 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 766.71 | step_microstep: 255.36
[2025-11-06 18:23:48,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 272.07 | bwd: 768.66 | bwd_inner: 1.79 | bwd_allreduce: 766.74 | step: 255.43
 46%|████▌     | 1596/3507 [39:02<1:01:59,  1.95s/it]                                                     {'loss': 0.7901, 'learning_rate': 1.1932213914883322e-05, 'epoch': 0.46}
 46%|████▌     | 1596/3507 [39:02<1:01:59,  1.95s/it]tensor([[-3.1719, -0.1064,  2.7344,  0.1162, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -1.9453,  0.9336, -0.5781, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -2.8125,  0.6133, -0.2158, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:49,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.99 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -0.2695,  3.2344, -2.1562, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -0.6211,  1.8438, -0.2168, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -4.1875, -1.5078,  2.3594, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1562, -2.8281, -2.0312,  1.5703, -0.4551]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -2.9062,  0.8594, -1.6562, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:49,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:23:49,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.03 | bwd_microstep: 4.55 | bwd_inner_microstep: 3.57 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.42
[2025-11-06 18:23:49,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.05 | bwd: 5.23 | bwd_inner: 4.16 | bwd_allreduce: 0.92 | step: 2.50
 46%|████▌     | 1597/3507 [39:03<47:40,  1.50s/it]                                                     {'loss': 0.2715, 'learning_rate': 1.1923149904770914e-05, 'epoch': 0.46}
 46%|████▌     | 1597/3507 [39:03<47:40,  1.50s/it]tensor([[-5.0938, -4.7812, -1.2812,  2.5625, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:49,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.31 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0938, -3.7031, -0.7578,  2.5156, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -3.3438,  1.5781,  1.3828, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -1.2578,  1.8281,  1.5156, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -3.0312, -0.0469, -1.9141, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9922,  1.8516,  3.6406, -1.7266, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4062, -3.1094,  0.3652,  1.9766, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -1.2578,  2.3750,  0.4121, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:52,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:23:52,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.90 | bwd_microstep: 1.78 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.97
[2025-11-06 18:23:52,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.21 | bwd: 2.70 | bwd_inner: 1.69 | bwd_allreduce: 0.86 | step: 3.07
 46%|████▌     | 1598/3507 [39:05<58:13,  1.83s/it]                                                   {'loss': 0.5352, 'learning_rate': 1.1914084253692486e-05, 'epoch': 0.46}
 46%|████▌     | 1598/3507 [39:05<58:13,  1.83s/it]tensor([[-3.8281, -3.5312, -0.6562,  2.8594, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -0.9062,  1.8516, -0.6484, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -4.1250, -1.3438,  2.1875, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.7188, -5.0312, -2.9219,  1.1875, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([3], device='cuda:0')
[2025-11-06 18:23:52,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.61 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.7500, -3.2500,  0.1719,  1.1328, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8594, -2.8594,  0.3848,  2.4844, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0781,  0.4180,  2.5625, -1.5078, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -2.4531,  1.4844,  0.4863, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:53,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.31 | optimizer_step: 0.41
[2025-11-06 18:23:53,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.70 | bwd_microstep: 1495.28 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1494.14 | step_microstep: 3.13
[2025-11-06 18:23:53,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.34 | bwd: 1496.19 | bwd_inner: 1.79 | bwd_allreduce: 1494.21 | step: 3.23
 46%|████▌     | 1599/3507 [39:07<58:56,  1.85s/it]                                                   {'loss': 0.3969, 'learning_rate': 1.1905016969383484e-05, 'epoch': 0.46}
 46%|████▌     | 1599/3507 [39:07<58:56,  1.85s/it]tensor([[-4.4062, -4.7500, -2.1562,  2.5469, -1.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -5.3125, -0.3789,  1.7969, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:23:54,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.46 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3750, -3.1250,  0.2412,  1.8047, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281e+00, -6.8750e-01,  2.7188e+00, -1.6479e-03, -3.2969e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -3.7969, -0.1465,  1.1172, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -2.5938,  1.6719,  2.0625, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -1.2500,  3.0000,  0.7930, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -3.9688, -0.9609,  3.1406, -1.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:54,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:23:54,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.04 | bwd_microstep: 66.15 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 65.13 | step_microstep: 1.95
[2025-11-06 18:23:54,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.54 | bwd: 66.92 | bwd_inner: 1.59 | bwd_allreduce: 65.17 | step: 2.04
 46%|████▌     | 1600/3507 [39:08<45:25,  1.43s/it]                                                   {'loss': 0.2204, 'learning_rate': 1.189594805958075e-05, 'epoch': 0.46}
 46%|████▌     | 1600/3507 [39:08<45:25,  1.43s/it]tensor([[-3.4375,  0.0135,  2.2969, -1.6875, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:54,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.33 | bwd_microstep: 5.35 | bwd_inner_microstep: 5.20 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1562, -3.3750, -0.2695,  2.4531, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.9375,  0.6680,  0.2930, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -2.2031,  2.3750,  0.9609, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2812, -4.6250,  0.1030, -0.7617, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4375, -1.5156,  2.8750, -1.1484, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656,  0.9727,  3.4375, -2.6094, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.9375, -3.4688, -0.5312,  2.6875, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:23:56,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.25 | optimizer_step: 0.35
[2025-11-06 18:23:56,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.16 | bwd_microstep: 1268.51 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 1267.57 | step_microstep: 2.73
[2025-11-06 18:23:56,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.53 | bwd: 1273.85 | bwd_inner: 6.05 | bwd_allreduce: 1267.63 | step: 2.82
 46%|████▌     | 1601/3507 [39:09<47:25,  1.49s/it]                                                   {'loss': 0.6501, 'learning_rate': 1.1886877532022512e-05, 'epoch': 0.46}
 46%|████▌     | 1601/3507 [39:09<47:25,  1.49s/it]tensor([[-3.8750,  0.2910,  3.5938, -1.7266, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:23:56,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.98 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2812, -3.9375,  0.0659,  2.1562, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.4375, -5.6562, -1.4219, -0.5703, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -4.3125, -0.7891,  2.9375, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.4688, -1.5156,  2.5000, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2500, -3.1250, -0.2285,  3.9375, -1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -5.5625, -2.5781,  1.3438, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -4.1875, -1.6250,  2.0469, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:23:59,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:23:59,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.08 | bwd_microstep: 2933.73 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 2932.77 | step_microstep: 3.24
[2025-11-06 18:23:59,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.09 | bwd: 2934.57 | bwd_inner: 1.57 | bwd_allreduce: 2932.83 | step: 3.34
 46%|████▌     | 1602/3507 [39:13<1:05:15,  2.06s/it]                                                     {'loss': 0.086, 'learning_rate': 1.1877805394448378e-05, 'epoch': 0.46}
 46%|████▌     | 1602/3507 [39:13<1:05:15,  2.06s/it]tensor([[-3.2188, -2.5312,  0.0444,  2.3438, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -4.5625, -1.2969,  2.4219, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -3.8906, -0.2148,  1.6719, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -2.8906,  0.5938,  2.4219, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -4.0625, -1.6797,  2.6875, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -4.7500, -0.2578,  1.2344, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625,  1.0859,  1.8984, -2.2656, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:02,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.62 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5156, -1.9688,  0.7734,  3.5312, -0.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:02,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.12 | optimizer_step: 0.15
[2025-11-06 18:24:02,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 401.13 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.45
[2025-11-06 18:24:02,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 543.77 | bwd: 2.79 | bwd_inner: 1.84 | bwd_allreduce: 0.84 | step: 2.53
 46%|████▌     | 1603/3507 [39:16<1:18:51,  2.48s/it]                                                     {'loss': 0.8418, 'learning_rate': 1.1868731654599332e-05, 'epoch': 0.46}
 46%|████▌     | 1603/3507 [39:16<1:18:51,  2.48s/it]tensor([[-10.0625,  -6.6562,  -1.4609,  -3.5625,  -8.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625,  0.7344,  2.8281, -2.3906, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:24:03,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.60 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -2.6875,  0.9727,  1.5078, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344,  0.1816,  2.1562, -1.9219, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -4.4062, -1.2812,  3.6875, -1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -3.4375,  0.4453,  1.2891, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -1.8359,  1.5312,  0.6562, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3438, -2.2031,  2.0156,  2.1406, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:24:03,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:24:03,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.13 | bwd_microstep: 130.63 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 129.47 | step_microstep: 1.78
[2025-11-06 18:24:03,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.73 | bwd: 131.58 | bwd_inner: 1.93 | bwd_allreduce: 129.51 | step: 1.85
 46%|████▌     | 1604/3507 [39:17<59:59,  1.89s/it]                                                     {'loss': 0.8093, 'learning_rate': 1.1859656320217723e-05, 'epoch': 0.46}
 46%|████▌     | 1604/3507 [39:17<59:59,  1.89s/it]tensor([[-2.4688,  0.7930,  3.0938, -0.2246, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.0625, -3.8281,  0.8477,  0.7031, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0625, -3.6719,  1.4297, -1.1172, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -2.5625,  0.5508,  2.0938, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719,  0.4668,  3.4062, -0.6836, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -1.9531,  1.9141, -0.7305, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -2.8906,  1.0547,  0.2451, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:04,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.61 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -3.9531, -0.7227,  3.1250, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:04,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:24:04,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.13 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.70
[2025-11-06 18:24:04,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.76 | bwd: 2.87 | bwd_inner: 1.90 | bwd_allreduce: 0.85 | step: 1.79
 46%|████▌     | 1605/3507 [39:18<54:39,  1.72s/it]                                                   {'loss': 0.5939, 'learning_rate': 1.185057939904726e-05, 'epoch': 0.46}
 46%|████▌     | 1605/3507 [39:18<54:39,  1.72s/it]tensor([[-4.0312, -2.2500,  1.1328,  1.8672, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -2.2812,  0.8398,  0.1396, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5938, -3.2500, -0.4551,  2.8750, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:04,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.75 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.4062, -5.7812, -1.9375,  1.4297, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.1562,  1.2891,  1.8828, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625, -1.0391,  0.9805,  0.2471, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5312, -4.3438,  0.7695,  1.4688, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -3.4375,  1.2188,  1.9375, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:05,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:24:05,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.10 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.89 | step_microstep: 1.56
[2025-11-06 18:24:05,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 505.88 | bwd: 2.86 | bwd_inner: 1.80 | bwd_allreduce: 0.92 | step: 1.63
 46%|████▌     | 1606/3507 [39:19<43:28,  1.37s/it]                                                   {'loss': 0.6849, 'learning_rate': 1.1841500898833005e-05, 'epoch': 0.46}
 46%|████▌     | 1606/3507 [39:19<43:28,  1.37s/it]tensor([[-2.5312,  0.4023,  1.4219, -1.9141, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.9609, -0.1240,  2.6875,  3.1094, -1.0234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2812,  0.1143,  1.8047, -0.2715, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -3.9688, -2.5781,  2.0781, -0.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.6250, -1.8203,  1.3281,  1.5469, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1094, -0.8672,  1.5469,  0.5234, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9062, -1.0625,  1.8750, -0.4883, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:06,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.94 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.5938, -2.8750, -1.2812,  2.3438, -0.7930]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:24:07,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:24:07,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.63 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.28
[2025-11-06 18:24:07,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.58 | bwd: 2.82 | bwd_inner: 1.75 | bwd_allreduce: 0.91 | step: 2.38
 46%|████▌     | 1607/3507 [39:20<47:11,  1.49s/it]                                                   {'loss': 1.5583, 'learning_rate': 1.1832420827321374e-05, 'epoch': 0.46}
 46%|████▌     | 1607/3507 [39:20<47:11,  1.49s/it]tensor([[-4.3438, -3.4062, -0.1523,  2.3281, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -4.6562, -1.6094,  2.4531, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:07,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.87 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3594,  0.3105,  2.4688,  0.6641, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250,  0.4863,  3.7656, -2.8125, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -1.3594,  2.0312, -0.7031, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8281,  1.0312,  2.4688, -0.5078, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -0.9766,  2.5469, -1.1953, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5625, -6.0000, -3.0469,  1.8594, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:09,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.21 | optimizer_step: 0.24
[2025-11-06 18:24:09,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.76 | bwd_microstep: 2120.73 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 2119.52 | step_microstep: 2.65
[2025-11-06 18:24:09,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 424.66 | bwd: 2121.70 | bwd_inner: 2.00 | bwd_allreduce: 2119.57 | step: 2.72
 46%|████▌     | 1608/3507 [39:23<58:00,  1.83s/it]                                                   {'loss': 0.1077, 'learning_rate': 1.1823339192260117e-05, 'epoch': 0.46}
 46%|████▌     | 1608/3507 [39:23<58:00,  1.83s/it]tensor([[-4.9375, -3.9219, -0.0535,  2.8906, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5469,  1.8984,  2.9844, -1.6953, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-7.8438, -5.8125, -0.3965,  0.8438, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:09,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.49 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2500, -1.3750,  1.9531, -0.2227, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6641, -1.2031,  1.0625,  3.6094, -0.2930]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -3.8125, -0.1885,  3.0469, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -2.3438,  1.4531,  2.0000, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1719, -3.0469,  0.0092,  4.1875, -1.0391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:10,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.86 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:24:10,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.28 | bwd_microstep: 94.24 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 93.21 | step_microstep: 2.48
[2025-11-06 18:24:10,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.79 | bwd: 95.04 | bwd_inner: 1.64 | bwd_allreduce: 93.25 | step: 2.56
 46%|████▌     | 1609/3507 [39:24<45:39,  1.44s/it]                                                   {'loss': 0.6313, 'learning_rate': 1.1814256001398319e-05, 'epoch': 0.46}
 46%|████▌     | 1609/3507 [39:24<45:39,  1.44s/it]tensor([[-4.8750, -2.8906,  0.8008,  0.8359, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -4.2812, -1.9219,  2.1719, -1.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -3.2031, -0.8711,  1.2344, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -4.1250, -2.0156,  2.3281, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:10,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.24 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5156, -0.8555,  2.4531,  0.9219, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4375, -3.8281, -1.9531,  2.1250, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -0.2148,  3.7188, -2.1250, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2344, -2.8125,  0.4395,  4.0312, -1.2422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:11,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:24:11,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.27 | bwd_microstep: 1212.75 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1211.88 | step_microstep: 2.40
[2025-11-06 18:24:11,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.54 | bwd: 1213.48 | bwd_inner: 1.39 | bwd_allreduce: 1211.93 | step: 2.48
 46%|████▌     | 1610/3507 [39:25<47:36,  1.51s/it]                                                   {'loss': 0.1588, 'learning_rate': 1.1805171262486397e-05, 'epoch': 0.46}
 46%|████▌     | 1610/3507 [39:25<47:36,  1.51s/it]tensor([[-8.1875, -4.9688, -1.4375, -4.4688, -7.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.8750, -5.4062, -1.8359,  1.4922, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:12,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.43 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -3.7812, -0.6680,  2.4688, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -2.0781,  1.4219,  0.2969, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -2.3750,  2.2344,  1.5938, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.6250, -1.2109,  2.7344, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344,  0.5195,  3.1562, -2.7188, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.4375, -7.4688, -2.8750,  0.3789, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:24:12,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:24:12,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.26 | bwd_microstep: 7.27 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 6.44 | step_microstep: 1.90
[2025-11-06 18:24:12,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.71 | bwd: 8.10 | bwd_inner: 1.47 | bwd_allreduce: 6.48 | step: 1.98
 46%|████▌     | 1611/3507 [39:26<37:05,  1.17s/it]                                                   {'loss': 0.7145, 'learning_rate': 1.1796084983276084e-05, 'epoch': 0.46}
 46%|████▌     | 1611/3507 [39:26<37:05,  1.17s/it]tensor([[-3.7500, -0.1875,  2.7969, -0.9844, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -2.4531,  1.5781,  2.8750, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2812, -3.7188, -1.7031,  2.6094, -1.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:12,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.00 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5938, -5.0312, -0.7344,  3.4062, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719, -2.5938,  0.3047,  4.4375, -0.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9688, -3.2031, -2.0781,  0.9414, -1.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -1.3359,  0.7656,  0.0049, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -3.8281,  0.8984,  1.3906, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:24:13,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.24
[2025-11-06 18:24:13,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.38 | bwd_microstep: 682.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 682.03 | step_microstep: 2.56
[2025-11-06 18:24:13,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.40 | bwd: 683.87 | bwd_inner: 1.63 | bwd_allreduce: 682.08 | step: 2.65
 46%|████▌     | 1612/3507 [39:27<38:14,  1.21s/it]                                                   {'loss': 0.1958, 'learning_rate': 1.1786997171520429e-05, 'epoch': 0.46}
 46%|████▌     | 1612/3507 [39:27<38:14,  1.21s/it]tensor([[-1.7969,  2.2031,  4.5625, -0.9805, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:13,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.39 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.7812, -2.7812,  0.6484,  0.5195, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -2.9844,  0.9375,  1.7891, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0781,  1.6016,  2.8125, -2.7812, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125e+00, -3.0156e+00,  3.3875e-03,  3.8125e+00, -1.2500e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -3.0625,  0.7305,  2.2656, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0625, -4.3750, -0.0830,  1.1562, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -3.3281, -1.2812,  3.2031, -0.7852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:24:15,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.21 | optimizer_step: 0.22
[2025-11-06 18:24:15,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.70 | bwd_microstep: 1710.34 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 1708.94 | step_microstep: 4.53
[2025-11-06 18:24:15,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.10 | bwd: 1711.52 | bwd_inner: 2.28 | bwd_allreduce: 1709.02 | step: 4.63
 46%|████▌     | 1613/3507 [39:29<46:44,  1.48s/it]                                                   {'loss': 0.3471, 'learning_rate': 1.1777907834973787e-05, 'epoch': 0.46}
 46%|████▌     | 1613/3507 [39:29<46:44,  1.48s/it]tensor([[-7.1875, -3.9375,  1.5000, -0.2461, -5.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4844,  1.6953,  3.4844, -2.3281, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:15,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.88 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-5.4062, -4.1875, -0.6953,  1.0703, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -2.6562,  0.9297,  3.8906, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -3.4688,  1.1719, -0.4980, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0312, -1.6719,  1.8047,  1.5312, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -0.1108,  3.3125, -1.2969, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -1.7656,  1.7812, -2.3281, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:24:16,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:24:16,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.00 | bwd_microstep: 156.82 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 155.75 | step_microstep: 1.92
[2025-11-06 18:24:16,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.92 | bwd: 157.93 | bwd_inner: 1.95 | bwd_allreduce: 155.82 | step: 2.04
 46%|████▌     | 1614/3507 [39:30<37:40,  1.19s/it]                                                   {'loss': 0.1769, 'learning_rate': 1.176881698139182e-05, 'epoch': 0.46}
 46%|████▌     | 1614/3507 [39:30<37:40,  1.19s/it]tensor([[-5.1562, -4.4375, -0.8320,  2.2031, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:16,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.30 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.3984,  2.1562,  2.9219, -2.3906, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.7656, -0.2344,  2.2500, -1.7109, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -0.5391,  1.9062, -0.9922, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -5.2188, -2.1406,  1.8984, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.2500,  0.3965,  1.8906, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -1.5078,  2.1562,  0.8633, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -2.1562,  1.7812,  1.2969, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:24:18,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:24:18,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.97 | bwd_microstep: 1596.78 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1595.66 | step_microstep: 2.16
[2025-11-06 18:24:18,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.30 | bwd: 1597.82 | bwd_inner: 1.98 | bwd_allreduce: 1595.70 | step: 2.24
 46%|████▌     | 1615/3507 [39:32<49:46,  1.58s/it]                                                   {'loss': 0.3692, 'learning_rate': 1.1759724618531475e-05, 'epoch': 0.46}
 46%|████▌     | 1615/3507 [39:32<49:46,  1.58s/it]tensor([[-2.9531, -3.1875, -1.1953,  2.9531, -0.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.7188, -5.4375, -2.1562,  1.7266, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:18,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.80 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8750, -0.3750,  2.6250, -1.0078, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -1.7734,  2.1250, -1.2891, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -4.3125, -0.4512,  2.8594, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6562, -5.7500, -2.0625,  0.6953, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -4.3438, -2.2031,  0.3066, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -3.2344,  1.2891,  1.6172, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:24:19,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:24:19,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.12 | bwd_microstep: 108.01 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 106.92 | step_microstep: 1.78
[2025-11-06 18:24:19,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.92 | bwd: 108.82 | bwd_inner: 1.67 | bwd_allreduce: 106.97 | step: 1.87
 46%|████▌     | 1616/3507 [39:32<38:58,  1.24s/it]                                                   {'loss': 0.6501, 'learning_rate': 1.1750630754150995e-05, 'epoch': 0.46}
 46%|████▌     | 1616/3507 [39:32<38:58,  1.24s/it]tensor([[-4.2500, -1.5781,  1.1406, -0.9609, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -3.9688,  0.2539,  2.1406, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:19,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.60 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8594,  0.9609,  3.0781, -2.2500, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8672,  2.0625,  3.6250, -2.2969, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6641,  1.6016,  2.5625, -1.7031, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.2500, -5.1875, -1.5859,  2.9375, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.3750,  0.5508,  2.1875, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -2.2656,  1.1250,  1.1484, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:21,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:24:21,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.04 | bwd_microstep: 1679.17 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1678.14 | step_microstep: 2.63
[2025-11-06 18:24:21,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.67 | bwd: 1679.90 | bwd_inner: 1.57 | bwd_allreduce: 1678.19 | step: 2.71
 46%|████▌     | 1617/3507 [39:35<47:26,  1.51s/it]                                                   {'loss': 0.3588, 'learning_rate': 1.17415353960099e-05, 'epoch': 0.46}
 46%|████▌     | 1617/3507 [39:35<47:26,  1.51s/it]tensor([[-2.7500,  1.2031,  3.4219, -1.5547, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:21,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.12 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9531, -3.9844, -1.0703,  3.0625, -1.7109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -4.3438, -1.0078,  3.4531, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -3.8125,  1.2188,  0.2002, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -3.0781,  1.3203,  1.9609, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.7188,  1.8828,  2.9062, -2.1094, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.6875,  0.1797,  3.0938, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.9375, -1.5703,  1.7656, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:24:21,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:24:21,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.87 | bwd_microstep: 82.15 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 81.29 | step_microstep: 1.93
[2025-11-06 18:24:21,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.02 | bwd: 83.22 | bwd_inner: 1.74 | bwd_allreduce: 81.33 | step: 2.01
 46%|████▌     | 1618/3507 [39:35<36:55,  1.17s/it]                                                   {'loss': 0.5295, 'learning_rate': 1.1732438551868987e-05, 'epoch': 0.46}
 46%|████▌     | 1618/3507 [39:35<36:55,  1.17s/it]tensor([[-3.2812,  0.5820,  3.2344, -1.3359, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:21,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.14 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8750, -2.7656,  0.7109,  0.3789, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2656, -2.7656, -0.8672,  3.7188, -0.1807]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5000, -5.1250,  0.2451,  0.8945, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -4.3750,  0.3613,  2.0625, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -2.5000,  0.5234,  2.7812, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -4.0000, -1.5781,  2.4219, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.3750,  0.1904,  2.9688, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:24,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:24:24,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.90 | bwd_microstep: 2623.96 | bwd_inner_microstep: 6.08 | bwd_allreduce_microstep: 2617.78 | step_microstep: 2.26
[2025-11-06 18:24:24,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.07 | bwd: 2624.65 | bwd_inner: 6.67 | bwd_allreduce: 2617.83 | step: 2.34
 46%|████▌     | 1619/3507 [39:38<54:06,  1.72s/it]                                                   {'loss': 0.3094, 'learning_rate': 1.172334022949032e-05, 'epoch': 0.46}
 46%|████▌     | 1619/3507 [39:38<54:06,  1.72s/it]tensor([[-2.8906, -3.4688, -2.1094,  1.8672, -0.9180]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8672,  0.5703,  2.0156,  0.0557, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -1.6562,  1.8750,  2.4531, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2969,  1.8203,  2.3906, -1.6328, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:24:24,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.70 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.4375, -2.7969, -1.1328,  2.9375, -0.4961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.4375, -4.5312, -0.6523,  2.4062, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -4.3750,  0.1934,  1.7031, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4062, -2.5938,  1.8594, -1.9688, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:24:25,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.14 | optimizer_step: 0.24
[2025-11-06 18:24:25,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.05 | bwd_microstep: 22.68 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 21.56 | step_microstep: 1.71
[2025-11-06 18:24:25,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 474.78 | bwd: 23.45 | bwd_inner: 1.72 | bwd_allreduce: 21.60 | step: 1.78
 46%|████▌     | 1620/3507 [39:39<42:58,  1.37s/it]                                                   {'loss': 1.0336, 'learning_rate': 1.1714240436637224e-05, 'epoch': 0.46}
 46%|████▌     | 1620/3507 [39:39<42:58,  1.37s/it]tensor([[-5.6562, -4.6875, -0.9688,  1.8672, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -2.8125,  1.0938,  2.7188, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -0.2715,  3.1719, -2.0938, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -0.9805,  2.1250,  1.6250, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -1.7812,  2.4531,  0.1562, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:25,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.76 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3906,  0.9727,  2.4688, -1.8750, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5625, -3.2344,  1.4375,  1.2578, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0000, -3.7188, -1.7969,  3.0625, -0.7539]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:26,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:24:26,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.86 | bwd_microstep: 735.11 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 734.00 | step_microstep: 71.58
[2025-11-06 18:24:26,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.58 | bwd: 736.14 | bwd_inner: 1.96 | bwd_allreduce: 734.03 | step: 71.66
 46%|████▌     | 1621/3507 [39:40<45:13,  1.44s/it]                                                   {'loss': 0.2226, 'learning_rate': 1.1705139181074276e-05, 'epoch': 0.46}
 46%|████▌     | 1621/3507 [39:40<45:13,  1.44s/it]tensor([[-3.2031, -2.1406,  0.9922,  2.7969, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:26,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.03 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.1875, -5.7812, -1.8281, -0.2793, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969, -3.1094, -0.9258,  2.1719, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -3.9062, -0.1167,  1.6172, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.2812, -2.2656,  1.2266,  1.3984, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-5.5312, -4.2500, -0.4824,  1.3594, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.1250,  0.4023,  2.8125, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -3.2188,  1.0547,  1.4766, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:24:27,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:24:27,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.66 | bwd_microstep: 633.66 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 632.62 | step_microstep: 1.97
[2025-11-06 18:24:27,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 216.71 | bwd: 634.57 | bwd_inner: 1.77 | bwd_allreduce: 632.66 | step: 2.06
 46%|████▋     | 1622/3507 [39:41<39:54,  1.27s/it]                                                   {'loss': 0.5309, 'learning_rate': 1.1696036470567309e-05, 'epoch': 0.46}
 46%|████▋     | 1622/3507 [39:41<39:54,  1.27s/it]tensor([[-3.0156,  0.5938,  1.9922, -2.9375, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.5938, -2.7500, -1.1953,  1.9062, -0.9648]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -0.8203,  2.2812, -0.2393, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -1.3359,  3.0312, -0.8672, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -3.7656,  0.9375,  1.9297, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2344,  0.2178,  2.9375, -0.7070, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -0.1367,  3.4531, -1.1484, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:29,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.27 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6250, -3.7188, -1.5469,  2.1875, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:29,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:24:29,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.67 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.87 | step_microstep: 3.08
[2025-11-06 18:24:29,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.97 | bwd: 2.69 | bwd_inner: 1.65 | bwd_allreduce: 0.91 | step: 3.17
 46%|████▋     | 1623/3507 [39:43<47:18,  1.51s/it]                                                   {'loss': 0.4248, 'learning_rate': 1.1686932312883385e-05, 'epoch': 0.46}
 46%|████▋     | 1623/3507 [39:43<47:18,  1.51s/it]tensor([[-2.0938,  0.4746,  1.4375, -1.2109, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.1562, -4.2500, -1.2891,  2.9062, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:29,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.66 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1562, -2.6562,  0.9453,  2.5625, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.8125,  0.2490,  2.6406, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.9688, -0.2412,  2.6094, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7500, -5.0312,  0.6562,  0.4473, -5.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.2500, -5.4062, -0.3750,  1.0625, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -1.9688,  2.8438,  0.0449, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:24:30,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.25 | optimizer_step: 0.23
[2025-11-06 18:24:30,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.57 | bwd_microstep: 58.99 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 57.77 | step_microstep: 2.20
[2025-11-06 18:24:30,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 467.26 | bwd: 59.67 | bwd_inner: 1.72 | bwd_allreduce: 57.80 | step: 2.28
 46%|████▋     | 1624/3507 [39:44<38:30,  1.23s/it]                                                   {'loss': 0.3545, 'learning_rate': 1.1677826715790816e-05, 'epoch': 0.46}
 46%|████▋     | 1624/3507 [39:44<38:30,  1.23s/it]tensor([[-4.9375, -5.0625, -1.8906,  2.8750, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -1.4688,  1.2266,  1.1484, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.4531,  1.3750,  1.7188, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -0.0796,  2.1719, -1.3359, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -3.6719, -0.4141,  2.7656, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -4.5000, -0.7969,  1.8750, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -2.5938,  1.7969,  0.9570, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:32,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.00 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5000, -2.9688,  0.9648,  2.6250, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:32,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.17 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:24:32,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.50 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.89 | step_microstep: 4.05
[2025-11-06 18:24:32,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 468.53 | bwd: 2.79 | bwd_inner: 1.70 | bwd_allreduce: 0.94 | step: 4.14
 46%|████▋     | 1625/3507 [39:46<49:30,  1.58s/it]                                                   {'loss': 0.7086, 'learning_rate': 1.166871968705913e-05, 'epoch': 0.46}
 46%|████▋     | 1625/3507 [39:46<49:30,  1.58s/it]tensor([[-4.7812, -3.7812, -0.4004,  1.7812, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.0625, -6.3438, -1.0859,  0.7305, -5.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:33,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.90 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.2812, -4.3125,  0.6562,  1.7109, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -3.1719,  2.0000, -0.6484, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8984,  1.8203,  3.7969, -1.0547, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -1.4531,  2.2031, -1.2031, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -2.0156,  2.1250,  0.2812, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1250, -2.9219,  0.9062, -1.7344, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:24:33,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:24:33,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.99 | bwd_microstep: 156.38 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 155.00 | step_microstep: 2.06
[2025-11-06 18:24:33,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.92 | bwd: 157.22 | bwd_inner: 2.03 | bwd_allreduce: 155.04 | step: 2.15
 46%|████▋     | 1626/3507 [39:47<40:35,  1.29s/it]                                                   {'loss': 0.1383, 'learning_rate': 1.165961123445908e-05, 'epoch': 0.46}
 46%|████▋     | 1626/3507 [39:47<40:35,  1.29s/it]tensor([[-5.4688, -4.3750, -0.3477,  2.2500, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -3.6094, -0.2910,  4.3125, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -1.2734,  2.9375, -1.4219, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -1.6172,  2.3750,  0.8359, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -3.7344,  1.3359,  0.0835, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5312, -6.0312, -2.4219,  1.0156, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -5.7812, -2.5625,  2.0625, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:35,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.26 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9688, -2.7656,  1.4609,  1.7422, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:36,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:24:36,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.43 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.77
[2025-11-06 18:24:36,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.71 | bwd: 2.65 | bwd_inner: 1.76 | bwd_allreduce: 0.77 | step: 1.85
 46%|████▋     | 1627/3507 [39:50<55:20,  1.77s/it]                                                   {'loss': 0.1877, 'learning_rate': 1.1650501365762639e-05, 'epoch': 0.46}
 46%|████▋     | 1627/3507 [39:50<55:20,  1.77s/it]tensor([[-6.2188, -4.4688, -0.6523,  0.0938, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:36,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.32 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3906, -3.7969, -1.7734,  2.3750, -1.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -3.2031,  0.0376,  3.1250, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -5.6875, -1.7578,  2.0469, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.0312,  0.1621,  0.7969, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9062, -2.8125,  1.2422,  3.7656, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.9531, -0.2471,  2.9219, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281,  0.9102,  3.8438, -1.9531, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:24:36,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:24:36,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.63 | bwd_microstep: 104.23 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 103.17 | step_microstep: 1.86
[2025-11-06 18:24:36,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.98 | bwd: 105.14 | bwd_inner: 1.79 | bwd_allreduce: 103.21 | step: 1.95
 46%|████▋     | 1628/3507 [39:50<43:14,  1.38s/it]                                                   {'loss': 0.3181, 'learning_rate': 1.164139008874298e-05, 'epoch': 0.46}
 46%|████▋     | 1628/3507 [39:50<43:14,  1.38s/it]tensor([[-3.0000,  0.2715,  2.5000, -1.1406, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0938, -4.7812, -0.5820, -0.7266, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -3.8125, -1.0156,  3.7031, -1.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2188, -3.7344,  1.2891,  1.4141, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -3.4531,  0.0356,  3.2969, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -2.0000,  1.5391, -0.6406, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:37,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.96 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.9688, -3.3594, -0.0159,  0.5625, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -1.0000,  2.7656, -1.4297, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:24:37,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:24:37,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.35 | bwd_microstep: 35.45 | bwd_inner_microstep: 1.76 | bwd_allreduce_microstep: 33.60 | step_microstep: 1.82
[2025-11-06 18:24:37,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.34 | bwd: 36.77 | bwd_inner: 2.98 | bwd_allreduce: 33.65 | step: 1.91
 46%|████▋     | 1629/3507 [39:51<42:29,  1.36s/it]                                                   {'loss': 0.352, 'learning_rate': 1.1632277411174484e-05, 'epoch': 0.46}
 46%|████▋     | 1629/3507 [39:51<42:29,  1.36s/it]tensor([[-3.1719,  1.0469,  3.4375, -2.3906, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -1.7578,  1.3203,  2.5156, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:38,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.92 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5938, -3.9219, -1.8672,  2.2031, -1.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8984,  0.7422,  1.3438, -1.7812, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -2.3594,  1.4531,  1.9531, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -4.1250,  0.8125,  2.0625, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312,  0.9102,  4.0625, -1.9141, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -3.4062,  0.5820,  0.9531, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:38,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:24:38,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.36 | bwd_microstep: 1.76 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.78 | step_microstep: 1.96
[2025-11-06 18:24:38,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.29 | bwd: 2.63 | bwd_inner: 1.68 | bwd_allreduce: 0.82 | step: 2.05
 46%|████▋     | 1630/3507 [39:52<39:10,  1.25s/it]                                                   {'loss': 0.4789, 'learning_rate': 1.1623163340832725e-05, 'epoch': 0.46}
 46%|████▋     | 1630/3507 [39:52<39:10,  1.25s/it]tensor([[-4.4688, -2.0625,  1.6719,  0.9688, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -1.9688,  1.6406,  0.7969, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9062, -4.9688, -0.3086,  0.6797, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -0.5703,  2.7344,  0.8516, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -4.9062, -0.5938,  1.4688, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:39,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.28 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3750, -4.9375, -0.0226,  2.4844, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -3.6094, -0.0219,  1.8438, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -2.9531,  0.9023,  3.7031, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:39,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:24:39,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.64 | bwd_microstep: 1.73 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 1.41
[2025-11-06 18:24:39,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 479.94 | bwd: 2.67 | bwd_inner: 1.75 | bwd_allreduce: 0.78 | step: 1.49
 47%|████▋     | 1631/3507 [39:53<32:37,  1.04s/it]                                                   {'loss': 0.6181, 'learning_rate': 1.1614047885494463e-05, 'epoch': 0.47}
 47%|████▋     | 1631/3507 [39:53<32:37,  1.04s/it]tensor([[-4.4375, -1.8438,  1.8750,  0.8828, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -5.0625, -0.5898,  3.0781, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:39,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.34 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -6.4688, -3.3750,  1.5938, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -3.5000,  0.8438,  2.4844, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -2.7500,  1.1172, -0.6406, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -2.1250,  1.7031, -0.2402, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -1.9922,  0.9805, -2.1094, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -3.9375,  0.4180,  0.9531, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:41,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:24:41,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.40 | bwd_microstep: 895.76 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 894.67 | step_microstep: 2.16
[2025-11-06 18:24:41,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.76 | bwd: 896.82 | bwd_inner: 1.95 | bwd_allreduce: 894.72 | step: 2.24
 47%|████▋     | 1632/3507 [39:55<41:35,  1.33s/it]                                                   {'loss': 0.4257, 'learning_rate': 1.160493105293765e-05, 'epoch': 0.47}
 47%|████▋     | 1632/3507 [39:55<41:35,  1.33s/it]tensor([[-3.3906, -1.9688,  0.5820,  1.6719, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -2.2188,  2.0781,  0.2734, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250, -3.6719, -0.6914,  3.6094, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -2.9219,  0.9453, -0.2852, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0156,  0.8359,  3.6875,  1.6953, -1.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1406, -0.9766,  2.1094,  3.9062, -0.8242]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:42,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.70 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-6.0938, -4.9688, -1.1094,  1.1562, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -3.2031,  0.4961,  1.2188, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:42,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:24:42,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.76 | bwd_microstep: 80.84 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 79.50 | step_microstep: 1.73
[2025-11-06 18:24:42,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.47 | bwd: 81.96 | bwd_inner: 2.23 | bwd_allreduce: 79.56 | step: 1.84
 47%|████▋     | 1633/3507 [39:56<42:11,  1.35s/it]                                                   {'loss': 0.5699, 'learning_rate': 1.1595812850941392e-05, 'epoch': 0.47}
 47%|████▋     | 1633/3507 [39:56<42:11,  1.35s/it]tensor([[-6.4375, -3.1719,  1.3984, -0.7617, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:43,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.53 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0312, -4.3438,  0.2031,  1.6641, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -3.1875, -0.2275,  3.0312, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -3.8281,  1.0625,  0.2656, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -0.8359,  2.5312, -1.1953, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3750, -3.2188, -1.9219,  2.8438, -0.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -2.0781,  1.6875,  1.4297, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0625,  0.2354,  3.0312, -0.2441, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:24:44,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:24:44,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.37 | bwd_microstep: 873.20 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 872.03 | step_microstep: 2.87
[2025-11-06 18:24:44,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.87 | bwd: 874.00 | bwd_inner: 1.75 | bwd_allreduce: 872.08 | step: 2.96
 47%|████▋     | 1634/3507 [39:58<48:08,  1.54s/it]                                                   {'loss': 0.3784, 'learning_rate': 1.1586693287285989e-05, 'epoch': 0.47}
 47%|████▋     | 1634/3507 [39:58<48:08,  1.54s/it]tensor([[-4.7812, -3.5938, -0.1235,  1.6094, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -3.2500, -0.2354,  3.6719, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750, -2.6250,  0.7773,  3.5156, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.6406, -0.2168,  3.3125, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -0.5469,  2.0312, -1.6406, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4844, -3.0000, -0.4941,  2.2812, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:24:45,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.1875, -4.0625, -1.0703,  2.8125, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6875, -3.3750, -0.4121,  2.8750, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:46,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:24:46,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.16 | bwd_microstep: 948.48 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 947.18 | step_microstep: 1.79
[2025-11-06 18:24:46,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.03 | bwd: 949.45 | bwd_inner: 2.03 | bwd_allreduce: 947.24 | step: 1.91
 47%|████▋     | 1635/3507 [40:00<47:51,  1.53s/it]                                                   {'loss': 0.0726, 'learning_rate': 1.1577572369752886e-05, 'epoch': 0.47}
 47%|████▋     | 1635/3507 [40:00<47:51,  1.53s/it]tensor([[-4.9062, -4.7500, -1.3438,  2.7031, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -3.5625, -0.4570,  2.9375, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:46,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.71 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-5.2812, -3.5781, -0.0557,  0.6250, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -0.9609,  2.3750,  1.0312, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -4.4375, -2.9688,  1.3984, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -4.1875, -0.7695,  3.0156, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -2.7031,  1.0547,  1.2109, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -1.6484,  1.4688,  0.7148, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:24:48,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:24:48,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.18 | bwd_microstep: 1504.97 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1503.81 | step_microstep: 2.53
[2025-11-06 18:24:48,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.90 | bwd: 1505.83 | bwd_inner: 1.78 | bwd_allreduce: 1503.87 | step: 2.63
 47%|████▋     | 1636/3507 [40:02<50:38,  1.62s/it]                                                   {'loss': 0.4396, 'learning_rate': 1.1568450106124684e-05, 'epoch': 0.47}
 47%|████▋     | 1636/3507 [40:02<50:38,  1.62s/it]tensor([[-1.3906,  1.4297,  2.7969, -0.3652, -1.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:48,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 57.93 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.3125, -5.7188, -0.0713,  0.1807, -6.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -3.9688, -1.0625,  3.6094, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156,  0.4863,  2.4688, -1.6094, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -3.0312, -0.7344,  2.7344, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6875, -2.0938,  2.5781,  1.9141, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -0.8008,  3.2188,  0.3594, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2031, -2.3906, -0.5234,  3.2656, -0.3633]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:24:49,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:24:49,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.97 | bwd_microstep: 577.73 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 576.81 | step_microstep: 2.02
[2025-11-06 18:24:49,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 235.91 | bwd: 578.69 | bwd_inner: 1.70 | bwd_allreduce: 576.86 | step: 2.10
 47%|████▋     | 1637/3507 [40:02<43:19,  1.39s/it]                                                   {'loss': 0.6584, 'learning_rate': 1.155932650418514e-05, 'epoch': 0.47}
 47%|████▋     | 1637/3507 [40:02<43:19,  1.39s/it]tensor([[-4.1875, -1.6250,  3.1719,  2.7500, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:49,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.21 | bwd_microstep: 1.20 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8750, -5.1875, -2.3281,  2.3906, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -0.2119,  1.8984, -1.4609, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -5.5312, -0.9609,  2.8438, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.0000,  0.4805,  2.6875, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -1.3984,  2.6562, -0.8320, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0000, -3.2500,  1.3750,  0.1855, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.0000,  0.2617,  2.2344, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:24:49,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:24:49,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.91 | bwd_microstep: 403.81 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 402.46 | step_microstep: 1.53
[2025-11-06 18:24:49,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.16 | bwd: 405.00 | bwd_inner: 2.31 | bwd_allreduce: 402.51 | step: 1.62
 47%|████▋     | 1638/3507 [40:03<37:39,  1.21s/it]                                                   {'loss': 0.3099, 'learning_rate': 1.1550201571719153e-05, 'epoch': 0.47}
 47%|████▋     | 1638/3507 [40:03<37:39,  1.21s/it]tensor([[-5.1875, -5.4062, -2.5469,  1.8828, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -3.2969,  0.4160,  3.0312, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -1.8594,  0.7148,  0.2969, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -4.0000, -0.1914,  2.3438, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -1.8125,  1.7734,  2.8594, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6680,  1.2969,  1.9531,  0.4492, -0.6523]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:24:51,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.19 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7344, -2.5156,  0.8750,  2.7031, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3594, -3.6719, -2.2188,  0.9688, -1.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:52,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:24:52,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.70 | bwd_microstep: 718.67 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 717.77 | step_microstep: 2.22
[2025-11-06 18:24:52,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.92 | bwd: 719.33 | bwd_inner: 1.40 | bwd_allreduce: 717.81 | step: 2.30
 47%|████▋     | 1639/3507 [40:06<52:48,  1.70s/it]                                                   {'loss': 0.4678, 'learning_rate': 1.1541075316512746e-05, 'epoch': 0.47}
 47%|████▋     | 1639/3507 [40:06<52:48,  1.70s/it]tensor([[-1.6094, -0.5352,  2.1094,  3.5469, -0.4902]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:52,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.87 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.6250, -3.5156, -0.0840,  2.0000, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8320,  1.6406,  4.1875,  2.4062, -0.6680]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -3.3438,  1.5469,  1.6172, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -0.0228,  3.2500, -1.8906, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781,  0.3906,  3.2500, -1.7812, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -3.3594,  0.9375,  1.7969, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -2.0625,  1.6953,  1.0078, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:24:53,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.26
[2025-11-06 18:24:53,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.82 | bwd_microstep: 180.70 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 179.27 | step_microstep: 2.21
[2025-11-06 18:24:53,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.72 | bwd: 181.63 | bwd_inner: 2.15 | bwd_allreduce: 179.32 | step: 2.30
 47%|████▋     | 1640/3507 [40:07<41:44,  1.34s/it]                                                   {'loss': 0.6482, 'learning_rate': 1.1531947746353087e-05, 'epoch': 0.47}
 47%|████▋     | 1640/3507 [40:07<41:44,  1.34s/it]tensor([[-3.8906, -3.0938,  0.4023,  3.2188, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -2.4062,  2.3438, -0.2227, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -4.1250, -1.4844,  2.8750, -1.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1562, -6.3438, -1.5703,  2.1875, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:53,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.15 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.4688, -3.1875,  1.0703, -1.5078, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -1.6016,  2.0469, -0.2637, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0312, -5.3750, -0.1660,  1.8281, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -3.0938,  0.2148,  1.7344, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:55,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:24:55,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.64 | bwd_microstep: 1814.88 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1813.67 | step_microstep: 2.00
[2025-11-06 18:24:55,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 526.83 | bwd: 1815.88 | bwd_inner: 1.95 | bwd_allreduce: 1813.73 | step: 2.11
 47%|████▋     | 1641/3507 [40:09<54:43,  1.76s/it]                                                   {'loss': 0.6951, 'learning_rate': 1.1522818869028447e-05, 'epoch': 0.47}
 47%|████▋     | 1641/3507 [40:09<54:43,  1.76s/it]tensor([[-1.4688,  1.6094,  2.8594, -0.9844, -1.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -0.4102,  2.9531,  0.0396, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.0000, -2.4062,  1.1406, -0.0311, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([2], device='cuda:2')
tensor([[-4.4062, -1.7969,  1.4531, -0.0510, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:24:56,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.67 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.15
tensor([[-4.1875, -0.1079,  2.6094, -2.6875, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6133,  2.9844,  4.3438, -0.5078, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -2.6719,  1.1406,  1.8984, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -1.2188,  3.2188, -0.3027, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:24:56,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:24:56,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.98 | bwd_microstep: 239.76 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 238.55 | step_microstep: 1.68
[2025-11-06 18:24:56,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.68 | bwd: 240.80 | bwd_inner: 2.03 | bwd_allreduce: 238.61 | step: 1.84
 47%|████▋     | 1642/3507 [40:10<44:12,  1.42s/it]                                                   {'loss': 0.396, 'learning_rate': 1.151368869232823e-05, 'epoch': 0.47}
 47%|████▋     | 1642/3507 [40:10<44:12,  1.42s/it]tensor([[-7.2500, -5.7188, -0.9961,  1.2109, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:56,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.94 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5938, -3.4531, -0.0928,  1.6250, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -3.6250,  0.9570,  0.8711, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -2.3125,  2.1094,  0.3516, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9688, -3.4531,  1.5234,  1.1719, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -4.3438, -1.2734,  1.8750, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2500, -5.2500, -1.1719,  1.5156, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -2.8750,  0.6719,  1.5938, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:24:59,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:24:59,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.44 | bwd_microstep: 2303.02 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2301.82 | step_microstep: 2.21
[2025-11-06 18:24:59,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.40 | bwd: 2304.05 | bwd_inner: 2.04 | bwd_allreduce: 2301.87 | step: 2.30
 47%|████▋     | 1643/3507 [40:13<55:34,  1.79s/it]                                                   {'loss': 0.8122, 'learning_rate': 1.1504557224042943e-05, 'epoch': 0.47}
 47%|████▋     | 1643/3507 [40:13<55:34,  1.79s/it]tensor([[-2.9688,  0.0942,  1.6953, -1.6719, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0000, -2.9531, -2.2656,  1.9844, -0.1387]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.0938,  0.8047,  3.0625,  0.5469, -1.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:24:59,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.7656, -3.5156, -2.6406,  1.3828, -0.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.5938, -1.2109,  2.9531, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.0000, -6.0000, -0.7578,  0.5195, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.4688, -1.4297,  1.1719, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -4.0938, -0.7109,  2.6562, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:24:59,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:24:59,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 95.57 | bwd_microstep: 174.76 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 173.67 | step_microstep: 2.05
[2025-11-06 18:24:59,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.28 | bwd: 175.61 | bwd_inner: 1.72 | bwd_allreduce: 173.73 | step: 2.14
 47%|████▋     | 1644/3507 [40:13<43:37,  1.41s/it]                                                   {'loss': 0.4034, 'learning_rate': 1.1495424471964187e-05, 'epoch': 0.47}
 47%|████▋     | 1644/3507 [40:13<43:37,  1.41s/it]tensor([[-2.6406,  0.8398,  2.3125, -2.3281, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -4.3125, -0.1797,  2.2969, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:00,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.13 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.1250,  1.2031,  3.7344, -0.1787, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031, -0.6797,  3.2812,  2.2500, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -2.8906,  1.2578,  1.8438, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -2.9375,  1.2109,  1.3281, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1250, -2.9688,  1.5078, -0.6641, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -1.1328,  2.1250,  0.5859, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:25:00,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:25:00,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 254.13 | bwd_microstep: 534.61 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 533.62 | step_microstep: 2.13
[2025-11-06 18:25:00,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 466.30 | bwd: 535.63 | bwd_inner: 1.79 | bwd_allreduce: 533.67 | step: 2.23
 47%|████▋     | 1645/3507 [40:14<40:16,  1.30s/it]                                                   {'loss': 0.3465, 'learning_rate': 1.1486290443884666e-05, 'epoch': 0.47}
 47%|████▋     | 1645/3507 [40:14<40:16,  1.30s/it]tensor([[-2.1719,  1.4531,  3.2500, -1.7109, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -5.7812, -2.2656,  1.4609, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -0.6172,  1.9375, -2.1719, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7500, -2.5625,  1.3672,  1.0312, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:01,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.81 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.4531,  1.5391,  3.2188, -2.7500, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8281,  0.3184,  3.5312,  0.5469, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -1.2891,  2.1250, -0.3594, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.5000,  0.2021,  3.3281, -0.8906, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:01,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:25:01,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.21 | bwd_microstep: 2.08 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.96 | step_microstep: 2.29
[2025-11-06 18:25:01,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 428.09 | bwd: 2.87 | bwd_inner: 1.68 | bwd_allreduce: 1.01 | step: 2.38
 47%|████▋     | 1646/3507 [40:15<32:37,  1.05s/it]                                                   {'loss': 0.9082, 'learning_rate': 1.147715514759818e-05, 'epoch': 0.47}
 47%|████▋     | 1646/3507 [40:15<32:37,  1.05s/it]tensor([[-2.5469, -0.6953,  1.1406,  0.2480, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:01,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.33 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.3125, -6.0000, -0.9570,  1.7422, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.6406, -0.3633,  2.7812, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9531, -1.2031,  1.2656,  3.4531, -0.5586]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -3.4688, -0.7031,  2.1875, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6875, -4.8750, -0.2578,  1.2188, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -1.1328,  2.8281, -1.7188, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.0135,  1.4453,  3.3125,  3.3281,  0.4668]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:25:04,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.19 | optimizer_step: 0.24
[2025-11-06 18:25:04,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.64 | bwd_microstep: 503.25 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 502.32 | step_microstep: 2.47
[2025-11-06 18:25:04,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.99 | bwd: 504.04 | bwd_inner: 1.49 | bwd_allreduce: 502.37 | step: 2.56
 47%|████▋     | 1647/3507 [40:18<49:46,  1.61s/it]                                                   {'loss': 0.233, 'learning_rate': 1.1468018590899593e-05, 'epoch': 0.47}
 47%|████▋     | 1647/3507 [40:18<49:46,  1.61s/it]tensor([[-6.6562, -4.3125, -0.5586, -1.2109, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -5.7812, -2.1719,  1.6250, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -3.5312, -0.9883,  3.4375, -1.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2168,  2.3125,  2.3594, -0.7070, -0.7383]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.8125, -2.3438,  1.5312,  3.4688, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[1.5781, 2.5625, 4.6562, 6.2188, 2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -2.7344,  0.6016,  2.5469, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:04,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.07 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8906, -3.9688, -1.2969,  2.6250, -1.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:05,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:25:05,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.67 | bwd_microstep: 1.62 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.10
[2025-11-06 18:25:05,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.78 | bwd: 2.32 | bwd_inner: 1.34 | bwd_allreduce: 0.83 | step: 2.18
 47%|████▋     | 1648/3507 [40:18<43:30,  1.40s/it]                                                   {'loss': 0.232, 'learning_rate': 1.1458880781584858e-05, 'epoch': 0.47}
 47%|████▋     | 1648/3507 [40:18<43:30,  1.40s/it]tensor([[-4.4062, -3.0781,  0.4941,  1.9688, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -4.0000, -0.1943,  2.9219, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:05,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.95 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.6719, -4.0000, -1.5078,  2.8594, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.3750, -3.2188,  0.4727,  2.6094, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -4.7500, -1.3750,  2.1406, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -2.3281,  2.0625, -0.3008, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594, -2.1875,  1.0469,  1.6641, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -3.6875, -0.0771,  0.6836, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:25:08,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:25:08,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.96 | bwd_microstep: 2056.60 | bwd_inner_microstep: 1.82 | bwd_allreduce_microstep: 2054.67 | step_microstep: 2.43
[2025-11-06 18:25:08,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.94 | bwd: 2057.49 | bwd_inner: 2.57 | bwd_allreduce: 2054.74 | step: 2.53
 47%|████▋     | 1649/3507 [40:22<1:03:06,  2.04s/it]                                                     {'loss': 1.0662, 'learning_rate': 1.1449741727450994e-05, 'epoch': 0.47}
 47%|████▋     | 1649/3507 [40:22<1:03:06,  2.04s/it]tensor([[-4.0000, -0.6133,  2.2969, -1.0469, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9375,  1.8438,  2.7969, -2.5781, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2031,  0.3301,  2.7344, -1.4922, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -0.8281,  2.3906, -2.1250, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:08,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.51 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -1.2500,  2.6406, -1.5000, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[3.1719, 6.0625, 5.4062, 0.8281, 1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7500, -1.5547,  1.3672,  0.5469, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -2.9531,  0.7305,  1.4219, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:09,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:25:09,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.00 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.90
[2025-11-06 18:25:09,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.55 | bwd: 2.58 | bwd_inner: 1.61 | bwd_allreduce: 0.83 | step: 1.98
 47%|████▋     | 1650/3507 [40:22<48:35,  1.57s/it]                                                     {'loss': 0.4462, 'learning_rate': 1.144060143629608e-05, 'epoch': 0.47}
 47%|████▋     | 1650/3507 [40:22<48:35,  1.57s/it]tensor([[-3.7969, -4.5938, -2.5312,  2.6250, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -4.0938, -0.0591,  2.1562, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:09,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.18 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-1.1094,  2.5469,  5.0312,  0.5273, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.9375, -7.5938, -3.4531, -1.6875, -6.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -4.0625, -1.1094,  3.0156, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -1.2656,  1.1641,  1.2422, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.6406,  0.2441,  1.7656, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -3.5156,  0.1982,  2.3906, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:25:11,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.23 | optimizer_step: 0.19
[2025-11-06 18:25:11,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 89.91 | bwd_microstep: 1882.91 | bwd_inner_microstep: 9.60 | bwd_allreduce_microstep: 1873.20 | step_microstep: 18.34
[2025-11-06 18:25:11,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 257.11 | bwd: 1883.81 | bwd_inner: 10.38 | bwd_allreduce: 1873.26 | step: 18.44
 47%|████▋     | 1651/3507 [40:25<54:20,  1.76s/it]                                                   {'loss': 0.3738, 'learning_rate': 1.143145991591925e-05, 'epoch': 0.47}
 47%|████▋     | 1651/3507 [40:25<54:20,  1.76s/it]tensor([[-4.2812, -0.4141,  2.4375, -2.3750, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6094, -1.0781,  1.8750, -0.0830, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:11,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.58 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.1875, -6.2812, -1.5859,  1.7578, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -1.7812,  2.7969, -0.5625, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.6406, -0.0051,  1.1250, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625,  0.6328,  1.6641, -2.0781, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.8906, -4.0625, -2.0938,  1.2266, -1.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -0.6836,  2.2812, -0.9180, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:13,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:25:13,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.79 | bwd_microstep: 1164.39 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1163.36 | step_microstep: 2.37
[2025-11-06 18:25:13,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.38 | bwd: 1165.07 | bwd_inner: 1.51 | bwd_allreduce: 1163.40 | step: 2.46
 47%|████▋     | 1652/3507 [40:27<57:53,  1.87s/it]                                                   {'loss': 0.2734, 'learning_rate': 1.1422317174120691e-05, 'epoch': 0.47}
 47%|████▋     | 1652/3507 [40:27<57:53,  1.87s/it]tensor([[-3.7188, -3.0000, -0.7617,  1.1719, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -2.7031,  1.2109,  1.7891, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:13,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.63 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -2.7969,  0.8828,  0.2969, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.8750, -0.7773,  2.4688, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -3.4062,  0.2168,  1.7969, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -0.3789,  3.3125, -1.2344, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5312, -4.1562, -2.0938,  2.6094, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0781, -2.8125, -0.6602,  4.5312,  0.1455]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:25:14,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:25:14,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.82 | bwd_microstep: 259.47 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 258.33 | step_microstep: 2.49
[2025-11-06 18:25:14,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.47 | bwd: 260.70 | bwd_inner: 2.18 | bwd_allreduce: 258.38 | step: 2.57
 47%|████▋     | 1653/3507 [40:27<46:14,  1.50s/it]                                                   {'loss': 0.1805, 'learning_rate': 1.1413173218701629e-05, 'epoch': 0.47}
 47%|████▋     | 1653/3507 [40:27<46:14,  1.50s/it]tensor([[-4.1562, -2.8594,  0.5664,  2.0625, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.6406,  0.2910,  2.7031, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9766,  1.1953,  2.3125, -1.7109, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5000, -2.8438,  2.0625,  1.2266, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -1.8203,  2.9688, -0.3555, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:15,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.15 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-3.6094, -3.2500, -0.3125,  2.9375, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -0.5586,  2.6875, -2.5781, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -3.3750,  0.2754,  1.7422, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:16,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:25:16,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 663.79 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 662.52 | step_microstep: 2.39
[2025-11-06 18:25:16,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 496.02 | bwd: 665.00 | bwd_inner: 2.22 | bwd_allreduce: 662.60 | step: 2.51
 47%|████▋     | 1654/3507 [40:30<53:28,  1.73s/it]                                                   {'loss': 0.4917, 'learning_rate': 1.1404028057464329e-05, 'epoch': 0.47}
 47%|████▋     | 1654/3507 [40:30<53:28,  1.73s/it]tensor([[-2.5625, -3.3438, -2.3906,  1.6406, -0.6055]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1016,  2.0469,  2.8281, -1.5547, -1.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.0312, -0.7773,  2.8594,  0.0294, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')tensor([2], device='cuda:0')

[2025-11-06 18:25:16,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.48 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0312, -3.2500,  0.7500,  1.6719, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -2.1875,  0.4043,  1.2500, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -2.2031,  2.7188, -0.1650, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9766, -1.8828,  0.4805,  3.9375, -0.2393]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -3.8906, -0.2012,  1.8750, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:16,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:25:16,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.63 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.61 | step_microstep: 1.42
[2025-11-06 18:25:16,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.13 | bwd: 2.85 | bwd_inner: 2.10 | bwd_allreduce: 0.64 | step: 1.49
 47%|████▋     | 1655/3507 [40:30<40:49,  1.32s/it]                                                   {'loss': 0.6786, 'learning_rate': 1.1394881698212079e-05, 'epoch': 0.47}
 47%|████▋     | 1655/3507 [40:30<40:49,  1.32s/it]tensor([[-3.9219, -3.9375, -0.7109,  3.5312, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.7344, -0.4453,  2.7344, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -4.1562, -0.0742,  1.4375, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -4.7500, -1.6016,  3.1562, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -5.1250, -2.9531,  1.9844, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:25:18,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.16 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2969, -1.4141,  1.2656,  0.8594, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3281, -1.7891,  1.4219,  2.1562, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -4.1875, -1.6172,  2.7500, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:25:19,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:25:19,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.19 | bwd_microstep: 661.63 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 660.45 | step_microstep: 2.47
[2025-11-06 18:25:19,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.39 | bwd: 662.62 | bwd_inner: 1.97 | bwd_allreduce: 660.50 | step: 2.56
 47%|████▋     | 1656/3507 [40:33<52:15,  1.69s/it]                                                   {'loss': 0.7895, 'learning_rate': 1.1385734148749192e-05, 'epoch': 0.47}
 47%|████▋     | 1656/3507 [40:33<52:15,  1.69s/it]tensor([[-2.5781,  1.3672,  3.6406, -1.5078, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -2.6719, -0.0620,  1.0859, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -5.7812, -1.9531,  2.5625, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -2.0781,  2.1875,  0.1235, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:19,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.16 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7656, -3.2188, -0.1250,  2.7969, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.5156, -1.0547,  2.5000,  1.6328, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-4.1562, -1.0625,  1.9297, -0.9023, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969, -2.8438, -0.9844,  3.3438, -0.2715]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:19,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:25:19,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.95 | bwd_microstep: 1.82 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.90
[2025-11-06 18:25:19,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.13 | bwd: 2.77 | bwd_inner: 1.69 | bwd_allreduce: 0.90 | step: 1.99
 47%|████▋     | 1657/3507 [40:33<40:19,  1.31s/it]                                                   {'loss': 0.1466, 'learning_rate': 1.1376585416881002e-05, 'epoch': 0.47}
 47%|████▋     | 1657/3507 [40:33<40:19,  1.31s/it]tensor([[-4.0938, -3.6562, -0.4141,  3.0938, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -4.8750, -2.2500,  2.0625, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125, -2.7500, -1.7109,  1.6719, -0.5664]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.6719, -2.0000,  0.6406,  0.5820, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -2.9688,  1.8047,  1.5156, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:20,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.14 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7188, -2.4219,  1.1953,  0.3730, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -3.9844, -0.7812,  2.4531, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3438, -5.0000, -0.1172,  2.2656, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:25:22,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:25:22,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.24 | bwd_microstep: 1127.52 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 1125.83 | step_microstep: 2.12
[2025-11-06 18:25:22,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.39 | bwd: 1128.57 | bwd_inner: 2.54 | bwd_allreduce: 1125.87 | step: 2.20
 47%|████▋     | 1658/3507 [40:35<50:28,  1.64s/it]                                                   {'loss': 0.638, 'learning_rate': 1.1367435510413841e-05, 'epoch': 0.47}
 47%|████▋     | 1658/3507 [40:35<50:28,  1.64s/it]tensor([[-3.8281, -0.3496,  2.2500, -1.6172, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -3.9688, -1.0000,  3.0938, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.9375, -3.7969,  1.2734,  1.9141, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:22,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.40 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.1875, -3.8438,  0.0330, -0.7422, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.7500,  0.3555,  1.8906, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2500, -3.8125, -1.9141,  2.4219, -1.0547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -3.0938,  0.3066,  1.3984, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -2.4531,  1.5625,  0.5312, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:22,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:25:22,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.82 | bwd_microstep: 68.35 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 67.46 | step_microstep: 1.71
[2025-11-06 18:25:22,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.24 | bwd: 69.14 | bwd_inner: 1.49 | bwd_allreduce: 67.49 | step: 1.79
 47%|████▋     | 1659/3507 [40:36<39:10,  1.27s/it]                                                   {'loss': 0.9077, 'learning_rate': 1.135828443715505e-05, 'epoch': 0.47}
 47%|████▋     | 1659/3507 [40:36<39:10,  1.27s/it]tensor([[-3.5625, -3.7344, -1.2422,  2.8750, -1.3359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:22,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.84 | bwd_microstep: 5.34 | bwd_inner_microstep: 5.20 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.5156, -2.8125, -0.8047,  3.1094, -0.5586]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -2.4219,  2.2031,  0.4902, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -2.6406,  1.1016, -2.5938, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7500, -2.1094,  1.9688,  0.7305, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0938, -2.9062, -0.0383,  3.4688, -1.0859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -1.1406,  2.6875, -0.5352, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -4.7812, -0.0776,  1.5391, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:25:24,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:25:24,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.92 | bwd_microstep: 1879.74 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 1878.90 | step_microstep: 2.37
[2025-11-06 18:25:24,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.82 | bwd: 1885.07 | bwd_inner: 5.96 | bwd_allreduce: 1878.96 | step: 2.46
 47%|████▋     | 1660/3507 [40:38<48:14,  1.57s/it]                                                   {'loss': 0.1088, 'learning_rate': 1.1349132204912971e-05, 'epoch': 0.47}
 47%|████▋     | 1660/3507 [40:38<48:14,  1.57s/it]tensor([[-4.3125, -1.5000,  1.9531, -0.3477, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.0078, 0.5898, 1.7891, 5.6562, 2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.5625, -3.6406, -0.4609,  1.6484, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:24,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.32 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7656,  0.4453,  2.4375, -1.2578, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -2.6406,  1.3047,  0.5898, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -0.6055,  2.3125, -0.1816, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -3.5781,  1.3125,  1.8281, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -4.0312, -0.8008,  1.1719, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:25:29,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.28
[2025-11-06 18:25:29,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.59 | bwd_microstep: 4128.95 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 4127.59 | step_microstep: 2.62
[2025-11-06 18:25:29,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.94 | bwd: 4129.85 | bwd_inner: 2.08 | bwd_allreduce: 4127.63 | step: 2.70
 47%|████▋     | 1661/3507 [40:43<1:15:55,  2.47s/it]                                                     {'loss': 0.9402, 'learning_rate': 1.133997882149692e-05, 'epoch': 0.47}
 47%|████▋     | 1661/3507 [40:43<1:15:55,  2.47s/it]tensor([[-4.9062, -4.2188, -0.7344,  2.2188, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -5.4375, -0.2227,  2.3750, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -4.6875, -0.5234,  2.8594, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:29,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.67 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4375,  1.2500,  3.1875, -1.6562, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -1.8203,  0.7578,  2.1094, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5781,  1.3906,  3.3594, -2.0312, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.9375, -0.9336,  2.6094, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7812, -1.4297,  0.7266, -0.7461, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:25:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.61 | bwd_microstep: 30.93 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 29.91 | step_microstep: 1.38
[2025-11-06 18:25:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 490.30 | bwd: 31.84 | bwd_inner: 1.76 | bwd_allreduce: 29.94 | step: 1.46
 47%|████▋     | 1662/3507 [40:43<58:18,  1.90s/it]                                                     {'loss': 0.1361, 'learning_rate': 1.1330824294717214e-05, 'epoch': 0.47}
 47%|████▋     | 1662/3507 [40:43<58:18,  1.90s/it]tensor([[-4.4375, -4.3750, -1.1953,  2.7500, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594, -3.0781, -0.8320,  2.0781, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:30,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.59 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.1875, -2.8594,  1.5938,  1.1484, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.0312, -0.8281,  2.2031, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -0.8945,  3.2031, -1.4062, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -4.1250, -0.8359,  2.3125, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -4.7812, -1.6641,  2.5781, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -2.4531,  1.1250,  1.6016, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:25:31,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:25:31,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.22 | bwd_microstep: 1324.30 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 1323.30 | step_microstep: 1.68
[2025-11-06 18:25:31,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.83 | bwd: 1325.19 | bwd_inner: 1.70 | bwd_allreduce: 1323.35 | step: 1.76
 47%|████▋     | 1663/3507 [40:45<56:46,  1.85s/it]                                                   {'loss': 0.5959, 'learning_rate': 1.1321668632385123e-05, 'epoch': 0.47}
 47%|████▋     | 1663/3507 [40:45<56:46,  1.85s/it]tensor([[-4.9375, -2.5312,  0.7344, -0.4141, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:31,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.22 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -3.6094, -0.2793,  2.3750, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.1094,  0.8203,  1.7188, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0781,  0.6797,  3.3906, -1.3203, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812,  0.3047,  2.6250, -0.7188, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -2.3438,  2.0000,  0.4141, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -3.1094,  0.6875,  1.3125, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -4.0938,  0.0742,  1.5781, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:25:32,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:25:32,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.92 | bwd_microstep: 125.14 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 123.93 | step_microstep: 1.92
[2025-11-06 18:25:32,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.16 | bwd: 126.05 | bwd_inner: 1.96 | bwd_allreduce: 123.96 | step: 2.00
 47%|████▋     | 1664/3507 [40:45<44:28,  1.45s/it]                                                   {'loss': 0.338, 'learning_rate': 1.131251184231291e-05, 'epoch': 0.47}
 47%|████▋     | 1664/3507 [40:45<44:28,  1.45s/it]tensor([[-1.4375, -2.0312, -1.3906,  1.9141,  0.1040]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.8438, -2.9062,  0.4902,  0.1426, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:32,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.01 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5000, -3.5469, -0.9609,  2.9219, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -1.7266,  2.1094,  1.2734, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -3.2188,  1.4688,  0.7773, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -1.9219,  1.0234,  0.5469, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1250, -3.7969,  1.2891,  1.3984, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6562,  1.3672,  3.5781, -1.9375, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:25:33,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:25:33,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.66 | bwd_microstep: 1249.26 | bwd_inner_microstep: 1.66 | bwd_allreduce_microstep: 1247.52 | step_microstep: 3.69
[2025-11-06 18:25:33,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.70 | bwd: 1250.50 | bwd_inner: 2.78 | bwd_allreduce: 1247.57 | step: 3.78
 47%|████▋     | 1665/3507 [40:47<46:10,  1.50s/it]                                                   {'loss': 0.5897, 'learning_rate': 1.1303353932313784e-05, 'epoch': 0.47}
 47%|████▋     | 1665/3507 [40:47<46:10,  1.50s/it]tensor([[-2.9688,  0.1523,  2.5312, -0.9023, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -3.8750,  0.3281,  1.8438, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -1.6562,  1.5391, -0.0266, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:34,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.59 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.4062, -3.2969,  1.8359,  0.3750, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -1.7734,  2.5312, -0.0630, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7969, -3.3125, -0.4766,  2.1406, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -3.8438,  0.5391,  0.8477, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1250,  2.1250,  2.4844, -2.1719, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:34,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:25:34,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.93 | bwd_microstep: 137.94 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 136.79 | step_microstep: 1.70
[2025-11-06 18:25:34,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.54 | bwd: 138.91 | bwd_inner: 1.96 | bwd_allreduce: 136.82 | step: 1.77
 48%|████▊     | 1666/3507 [40:48<36:50,  1.20s/it]                                                   {'loss': 0.585, 'learning_rate': 1.1294194910201913e-05, 'epoch': 0.48}
 48%|████▊     | 1666/3507 [40:48<36:50,  1.20s/it]tensor([[-4.2812, -2.1406,  1.7500,  1.6172, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:34,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.69 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.2500,  0.3184,  2.4219, -2.2500, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.7812, -3.1562,  1.7188,  1.0781, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -3.4531,  0.3770, -1.0469, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -3.7031,  0.0566,  2.0625, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -2.8594,  0.7188,  0.9102, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6562, -1.7422,  1.7344,  1.8125, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.0625,  0.4199,  2.1250, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:25:36,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:25:36,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.53 | bwd_microstep: 1514.57 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1513.44 | step_microstep: 2.18
[2025-11-06 18:25:36,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.24 | bwd: 1515.50 | bwd_inner: 1.87 | bwd_allreduce: 1513.49 | step: 2.27
 48%|████▊     | 1667/3507 [40:49<43:04,  1.40s/it]                                                   {'loss': 0.6548, 'learning_rate': 1.1285034783792416e-05, 'epoch': 0.48}
 48%|████▊     | 1667/3507 [40:49<43:04,  1.40s/it]tensor([[-6.1562, -3.5312,  1.6250,  1.0938, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -4.7500, -1.9766,  1.2734, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0625, -6.0312, -2.7344,  1.3984, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:36,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.19 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4062, -2.0156,  1.1797,  2.2344, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1094,  1.7344,  3.1094, -2.4375, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -3.6406, -0.5195,  2.4531, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1875, -5.0625, -0.5781,  1.9688, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -1.5234,  1.2109, -1.5859, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:25:38,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.23 | optimizer_step: 0.25
[2025-11-06 18:25:38,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.42 | bwd_microstep: 2062.04 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2060.98 | step_microstep: 13.17
[2025-11-06 18:25:38,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.62 | bwd: 2063.27 | bwd_inner: 2.08 | bwd_allreduce: 2061.04 | step: 13.26
 48%|████▊     | 1668/3507 [40:52<54:13,  1.77s/it]                                                   {'loss': 0.1723, 'learning_rate': 1.1275873560901358e-05, 'epoch': 0.48}
 48%|████▊     | 1668/3507 [40:52<54:13,  1.77s/it]tensor([[-4.5625, -3.7812,  0.1865,  3.4219, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -1.3359,  2.4219, -2.1719, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:39,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.28 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.9375, -4.1875, -0.7227,  1.7734, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -2.9062, -0.8555, -0.1279, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -1.6641,  2.4688, -0.5430, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219, -2.3906,  1.2031,  2.2188, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.0469,  0.5391,  1.3438, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6094,  1.8125,  2.8125, -1.8672, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:25:39,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.40 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:25:39,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.71 | bwd_microstep: 220.60 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 219.73 | step_microstep: 2.97
[2025-11-06 18:25:39,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.01 | bwd: 221.60 | bwd_inner: 1.63 | bwd_allreduce: 219.79 | step: 3.06
 48%|████▊     | 1669/3507 [40:53<43:23,  1.42s/it]                                                   {'loss': 0.6993, 'learning_rate': 1.126671124934573e-05, 'epoch': 0.48}
 48%|████▊     | 1669/3507 [40:53<43:23,  1.42s/it]tensor([[-1.8438,  1.3594,  2.6250, -1.6406, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:39,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.99 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -3.9844, -0.8633,  2.4844, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -3.5625, -1.6016,  1.9453, -1.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.6094, -0.6016,  2.1719, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -4.2188, -1.0156,  0.6719, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3594,  0.9219,  2.5312, -1.5859, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938,  0.4297,  2.9688, -2.5000, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.8594, -0.1328,  2.2188, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:25:41,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.48 | optimizer_step: 0.18
[2025-11-06 18:25:41,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.58 | bwd_microstep: 2138.09 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 2137.24 | step_microstep: 2.17
[2025-11-06 18:25:41,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 271.58 | bwd: 2138.81 | bwd_inner: 1.38 | bwd_allreduce: 2137.28 | step: 2.25
 48%|████▊     | 1670/3507 [40:55<53:02,  1.73s/it]                                                   {'loss': 0.343, 'learning_rate': 1.1257547856943458e-05, 'epoch': 0.48}
 48%|████▊     | 1670/3507 [40:55<53:02,  1.73s/it]tensor([[-3.2500, -3.3594, -1.0781,  2.8906, -1.1016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -3.6875,  1.4375,  0.8633, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -3.3125,  0.4141,  1.6875, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -1.2266,  1.5547, -0.8750, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:42,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.81 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3750, -1.7031,  1.2188, -1.1562, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9531, -2.5156,  0.3887,  3.2812, -1.1328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562,  1.0156,  4.3438, -1.1562, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9844, -3.2188,  0.3086,  3.0781, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:25:42,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:25:42,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.00 | bwd_microstep: 29.86 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 28.86 | step_microstep: 1.93
[2025-11-06 18:25:42,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.84 | bwd: 30.62 | bwd_inner: 1.59 | bwd_allreduce: 28.90 | step: 2.01
 48%|████▊     | 1671/3507 [40:56<40:53,  1.34s/it]                                                   {'loss': 0.3124, 'learning_rate': 1.1248383391513391e-05, 'epoch': 0.48}
 48%|████▊     | 1671/3507 [40:56<40:53,  1.34s/it]tensor([[-3.7188, -0.9180,  2.1094, -0.1885, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -5.0000, -1.7188,  1.8750, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3594,  0.7969,  2.3594, -1.3828, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.3281, -3.0469, -0.5273,  2.2812, -1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -4.4375,  0.1089,  2.4688, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:42,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.86 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.1562,  1.2969,  1.8906, -0.9609, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-6.0938, -4.2500,  0.9258,  2.3125, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -2.4375,  1.1328,  0.0713, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:25:44,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:25:44,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.14 | bwd_microstep: 1479.22 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1478.14 | step_microstep: 1.89
[2025-11-06 18:25:44,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.01 | bwd: 1479.96 | bwd_inner: 1.65 | bwd_allreduce: 1478.18 | step: 1.98
 48%|████▊     | 1672/3507 [40:58<49:01,  1.60s/it]                                                   {'loss': 0.4779, 'learning_rate': 1.1239217860875294e-05, 'epoch': 0.48}
 48%|████▊     | 1672/3507 [40:58<49:01,  1.60s/it]tensor([[-3.6094,  0.1094,  3.6094, -0.6094, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7422,  0.8047,  1.5625, -0.9375, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.9062, -1.1953,  1.5781, -0.4355, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.1875, -6.6250, -1.0391,  1.1250, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:44,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.88 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -1.6016,  1.8828,  0.3809, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -0.5977,  3.2500, -1.0391, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7500, -4.5312,  0.8398,  1.3359, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -3.3438,  1.4609,  1.3672, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:44,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:25:44,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.66 | bwd_microstep: 1.77 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.37
[2025-11-06 18:25:44,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.57 | bwd: 2.79 | bwd_inner: 1.88 | bwd_allreduce: 0.78 | step: 1.45
 48%|████▊     | 1673/3507 [40:58<38:24,  1.26s/it]                                                   {'loss': 0.3736, 'learning_rate': 1.1230051272849833e-05, 'epoch': 0.48}
 48%|████▊     | 1673/3507 [40:58<38:24,  1.26s/it]tensor([[-2.2812,  0.4746,  2.1094, -0.5430, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:45,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.48 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9531, -4.0312, -1.2109,  2.9219, -1.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.7812, -0.6328,  1.8984, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -2.7188,  2.4688,  1.6797, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -0.4570,  3.5000, -1.0312, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -2.4062,  1.3047, -0.1582, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4688, -2.0938, -1.9375,  1.0859,  0.0299]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.7109,  1.7891,  3.3750, -1.1328, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:25:46,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.27
[2025-11-06 18:25:46,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.12 | bwd_microstep: 1586.87 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 1585.53 | step_microstep: 2.33
[2025-11-06 18:25:46,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.63 | bwd: 1587.75 | bwd_inner: 2.05 | bwd_allreduce: 1585.57 | step: 2.40
 48%|████▊     | 1674/3507 [41:00<44:51,  1.47s/it]                                                   {'loss': 0.4343, 'learning_rate': 1.1220883635258586e-05, 'epoch': 0.48}
 48%|████▊     | 1674/3507 [41:00<44:51,  1.47s/it]tensor([[-2.0469,  1.4844,  3.0156, -1.9609, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:47,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.03 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4062, -2.2344,  1.4375,  3.4375, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -4.1875, -1.5859,  2.0000, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0928,  3.1094,  3.8750, -0.8750, -1.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.4062,  1.2500,  3.2344, -1.7812, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[h264 @ 0x820ec00] mmco: unref short failure
[h264 @ 0x820ec00] mmco: unref short failure
tensor([[-6.7812, -4.5938,  0.7695,  1.3750, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -0.1973,  3.7188,  0.0261, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -5.3438, -0.6133,  2.3594, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:25:47,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:25:47,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.57 | bwd_microstep: 504.53 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 503.50 | step_microstep: 2.16
[2025-11-06 18:25:47,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 261.62 | bwd: 505.36 | bwd_inner: 1.69 | bwd_allreduce: 503.53 | step: 2.23
 48%|████▊     | 1675/3507 [41:01<38:42,  1.27s/it]                                                   {'loss': 0.2803, 'learning_rate': 1.1211714955924018e-05, 'epoch': 0.48}
 48%|████▊     | 1675/3507 [41:01<38:42,  1.27s/it]tensor([[-6.4062, -4.3125,  0.3828,  0.9922, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -4.9688,  0.1787,  2.1562, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281,  0.0220,  2.6406, -1.3750, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:47,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.46 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -1.4531,  2.3594,  1.0078, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8281,  0.0356,  2.2500, -2.8906, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.0625, -3.5000, -1.8594,  1.8828, -1.0703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -3.0625,  0.2930,  2.2031, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -1.7344,  2.0312,  0.9102, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:48,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:25:48,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.16 | bwd_microstep: 67.01 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 65.50 | step_microstep: 1.90
[2025-11-06 18:25:48,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.65 | bwd: 67.93 | bwd_inner: 2.26 | bwd_allreduce: 65.54 | step: 1.99
 48%|████▊     | 1676/3507 [41:01<30:48,  1.01s/it]                                                   {'loss': 0.4734, 'learning_rate': 1.1202545242669498e-05, 'epoch': 0.48}
 48%|████▊     | 1676/3507 [41:01<30:48,  1.01s/it]tensor([[-1.1641,  1.7266,  2.5156, -1.0703, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.7500, -5.4375, -1.2656,  0.5859, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:48,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.88 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5625, -0.8750,  1.8828, -0.3105, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.2500, -5.4688, -1.2891, -0.8203, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -0.2852,  3.5781, -2.0938, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -3.9375,  0.1289,  1.5781, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -1.0078,  2.6719,  0.1040, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -1.1562,  2.7500, -0.2021, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:49,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:25:49,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.24 | bwd_microstep: 1301.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1300.00 | step_microstep: 2.46
[2025-11-06 18:25:49,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.14 | bwd: 1302.07 | bwd_inner: 1.87 | bwd_allreduce: 1300.05 | step: 2.55
 48%|████▊     | 1677/3507 [41:03<36:35,  1.20s/it]                                                   {'loss': 0.8319, 'learning_rate': 1.1193374503319255e-05, 'epoch': 0.48}
 48%|████▊     | 1677/3507 [41:03<36:35,  1.20s/it]tensor([[-4.2500, -4.1875, -0.8711,  3.2500, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -4.2188, -0.5391,  1.3594, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:49,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.29 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.7031, -3.2188, -0.7148,  1.7734, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9844, -4.0312, -1.6172,  2.0625, -1.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6875, -2.6250,  0.3574,  1.8047, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0938,  1.2969,  3.5312, -0.3984, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -4.0000,  0.3496,  0.1758, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -2.4844,  1.7109,  0.1133, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:25:50,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.81 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:25:50,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.32 | bwd_microstep: 807.61 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 806.51 | step_microstep: 2.68
[2025-11-06 18:25:50,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.65 | bwd: 808.69 | bwd_inner: 1.95 | bwd_allreduce: 806.57 | step: 2.79
 48%|████▊     | 1678/3507 [41:04<36:33,  1.20s/it]                                                   {'loss': 0.3902, 'learning_rate': 1.1184202745698414e-05, 'epoch': 0.48}
 48%|████▊     | 1678/3507 [41:04<36:33,  1.20s/it]tensor([[-5.5312, -4.5312, -0.4785,  1.9922, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -5.4375, -1.5781,  2.7812, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.4062, -0.5430,  2.4375, -0.0903, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([3], device='cuda:3')
 tensor([2], device='cuda:0')
[2025-11-06 18:25:51,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.45 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7344, -1.1953,  1.7578,  2.5312, -1.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -2.2500,  1.0859, -1.2266, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6406, -0.7383,  2.7656,  0.4648, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7031, -3.4375, -2.3281,  1.8594, -0.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1875,  1.8594,  3.0000, -3.1719, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:53,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:25:53,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.74 | bwd_microstep: 1687.78 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1686.93 | step_microstep: 1.76
[2025-11-06 18:25:53,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.22 | bwd: 1688.56 | bwd_inner: 1.44 | bwd_allreduce: 1686.97 | step: 1.84
 48%|████▊     | 1679/3507 [41:06<44:37,  1.46s/it]                                                   {'loss': 0.1623, 'learning_rate': 1.1175029977632954e-05, 'epoch': 0.48}
 48%|████▊     | 1679/3507 [41:06<44:37,  1.46s/it]tensor([[-2.1094,  1.2578,  2.0625, -2.7344, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:25:53,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.47 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8750, -2.0625,  0.8281, -1.5234, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -1.1719,  1.9531,  0.1436, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -4.8438, -1.5391,  2.0000, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -3.5312, -0.2578,  2.8906, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -3.3438,  0.6094,  1.4219, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -3.2656,  2.3281,  1.1562, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8281,  1.5547,  3.5469, -0.1299, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:25:54,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 18:25:54,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.72 | bwd_microstep: 718.52 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 717.40 | step_microstep: 2.28
[2025-11-06 18:25:54,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.22 | bwd: 719.36 | bwd_inner: 1.77 | bwd_allreduce: 717.46 | step: 2.35
 48%|████▊     | 1680/3507 [41:07<41:26,  1.36s/it]                                                   {'loss': 0.4073, 'learning_rate': 1.1165856206949726e-05, 'epoch': 0.48}
 48%|████▊     | 1680/3507 [41:07<41:26,  1.36s/it]tensor([[-4.1875, -1.3359,  2.0312, -0.2197, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -5.3438, -0.9023,  2.0312, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -1.5000,  3.2969,  0.6016, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031,  0.3887,  2.5781, -1.8906, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.7031,  0.2656,  2.3906, -0.4961, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -1.9844,  1.4609, -2.3906, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:25:54,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.57 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1875, -3.8594,  0.1152,  1.7734, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -2.1562,  1.2031, -0.7578, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:25:55,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:25:55,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.67 | bwd_microstep: 189.41 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 188.40 | step_microstep: 1.63
[2025-11-06 18:25:55,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.26 | bwd: 190.43 | bwd_inner: 1.85 | bwd_allreduce: 188.44 | step: 1.72
 48%|████▊     | 1681/3507 [41:08<37:33,  1.23s/it]                                                   {'loss': 0.3945, 'learning_rate': 1.1156681441476429e-05, 'epoch': 0.48}
 48%|████▊     | 1681/3507 [41:08<37:33,  1.23s/it]tensor([[-3.5156, -1.1094,  1.6094,  0.0084, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1562, -5.2188, -0.8008,  2.3125, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5312,  0.9688,  2.3594, -2.3438, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:25:55,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.96 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -2.2969,  1.6406,  2.4375, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -4.2812,  0.0559, -1.8438, -5.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2656, -3.5938, -1.7500,  2.1250, -1.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.6562,  0.5547,  2.3438, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -3.0469,  0.4316,  2.5469, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:25:56,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:25:56,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.64 | bwd_microstep: 1435.37 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1434.24 | step_microstep: 2.04
[2025-11-06 18:25:56,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.62 | bwd: 1436.26 | bwd_inner: 1.85 | bwd_allreduce: 1434.29 | step: 2.13
 48%|████▊     | 1682/3507 [41:10<42:25,  1.39s/it]                                                   {'loss': 0.6758, 'learning_rate': 1.1147505689041624e-05, 'epoch': 0.48}
 48%|████▊     | 1682/3507 [41:10<42:25,  1.39s/it]tensor([[5.0000, 7.1250, 6.6562, 4.3125, 3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.5312,  1.5391,  2.6562, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -3.6094, -0.6875,  2.2812, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -2.9844,  0.2090,  1.9766, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0469, -2.5469, -2.0156,  1.0781, -0.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:25:58,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.61 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8125, -4.0938, -0.3887,  2.2812, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -0.4727,  2.3125, -1.9141, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -3.5781,  0.4883,  1.0234, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:25:59,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:25:59,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.40 | bwd_microstep: 926.43 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 925.29 | step_microstep: 1.65
[2025-11-06 18:25:59,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.03 | bwd: 927.38 | bwd_inner: 1.90 | bwd_allreduce: 925.33 | step: 1.73
 48%|████▊     | 1683/3507 [41:13<52:12,  1.72s/it]                                                   {'loss': 0.7108, 'learning_rate': 1.1138328957474691e-05, 'epoch': 0.48}
 48%|████▊     | 1683/3507 [41:13<52:12,  1.72s/it]tensor([[-4.7188, -4.8750, -2.1875,  1.8047, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -2.8906,  0.6328,  1.7266, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -2.7500,  1.1016,  1.2500, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:25:59,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.91 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2500, -2.8594,  1.0312,  2.4062, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -1.7578,  2.0625, -0.8711, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -4.4062, -1.1641,  1.8828, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4062, -3.0312,  1.8359, -0.7812, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -2.4844,  0.5000, -0.8867, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:01,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:26:01,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.01 | bwd_microstep: 1996.25 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1995.17 | step_microstep: 1.92
[2025-11-06 18:26:01,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.95 | bwd: 1997.29 | bwd_inner: 1.95 | bwd_allreduce: 1995.20 | step: 1.99
 48%|████▊     | 1684/3507 [41:15<58:37,  1.93s/it]                                                   {'loss': 0.2137, 'learning_rate': 1.1129151254605872e-05, 'epoch': 0.48}
 48%|████▊     | 1684/3507 [41:15<58:37,  1.93s/it]tensor([[-4.6875, -2.3906,  0.8242, -0.3359, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -3.7344, -0.9688,  2.3281, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.5469,  0.7305,  3.1406, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:01,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.11 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8125, -3.2969,  1.2734,  0.5273, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -4.9062, -3.3438,  0.8398, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.5625, -3.0781,  0.0281, -1.3828, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -3.5156,  0.4941,  1.3594, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -3.2031,  0.6211,  2.0469, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:26:03,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:26:03,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.49 | bwd_microstep: 1003.79 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1002.68 | step_microstep: 2.13
[2025-11-06 18:26:03,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.62 | bwd: 1004.62 | bwd_inner: 1.77 | bwd_allreduce: 1002.72 | step: 2.21
 48%|████▊     | 1685/3507 [41:16<53:16,  1.75s/it]                                                   {'loss': 0.6703, 'learning_rate': 1.1119972588266217e-05, 'epoch': 0.48}
 48%|████▊     | 1685/3507 [41:16<53:16,  1.75s/it]tensor([[-3.9375, -3.8438, -0.6953,  3.2969, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -0.4648,  2.4062, -0.3477, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.8438, -6.3125, -2.0938,  1.6562, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:03,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.3438, -3.9219, -0.6719,  2.4844, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4688,  1.5234,  3.9062, -1.5312, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -3.4688,  0.7461,  0.3770, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.3281,  0.9180,  0.9453, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -6.5000, -2.6250,  1.2422, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:03,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:26:03,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.46 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.89
[2025-11-06 18:26:03,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.02 | bwd: 2.61 | bwd_inner: 1.65 | bwd_allreduce: 0.84 | step: 1.96
 48%|████▊     | 1686/3507 [41:17<41:07,  1.35s/it]                                                   {'loss': 0.558, 'learning_rate': 1.1110792966287609e-05, 'epoch': 0.48}
 48%|████▊     | 1686/3507 [41:17<41:07,  1.35s/it]tensor([[-3.9531, -4.5000, -2.1875,  2.3594, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2656,  1.4141,  2.9375, -2.2656, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.2500, -0.6289,  1.1797, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:03,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.86 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3906, -0.8711,  2.0156,  0.2754, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438, -5.8125, -1.6406,  0.9023, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -4.8750,  0.1245,  2.6406, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6719, -3.0625, -0.1133,  2.2969, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7344, -2.1875,  0.7852,  3.6719, -0.9258]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:26:05,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:26:05,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.76 | bwd_microstep: 1313.33 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1312.22 | step_microstep: 2.18
[2025-11-06 18:26:05,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 461.66 | bwd: 1314.24 | bwd_inner: 1.83 | bwd_allreduce: 1312.26 | step: 2.26
 48%|████▊     | 1687/3507 [41:19<45:19,  1.49s/it]                                                   {'loss': 0.1204, 'learning_rate': 1.1101612396502743e-05, 'epoch': 0.48}
 48%|████▊     | 1687/3507 [41:19<45:19,  1.49s/it]tensor([[-5.4062, -5.7500, -3.4375,  0.6562, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -3.2031,  0.4414, -0.3770, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:05,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.68 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5625, -1.9531,  0.7148,  1.2500, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -4.2812, -1.8672,  1.1406, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.0938,  0.1768,  2.3906, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6719,  0.4727,  3.2812,  0.1992, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3438,  0.7188,  3.6406, -1.5156, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9375, -4.2812,  0.9258,  0.2812, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:26:05,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.29
[2025-11-06 18:26:05,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.38 | bwd_microstep: 39.82 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 38.82 | step_microstep: 2.15
[2025-11-06 18:26:05,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.10 | bwd: 40.50 | bwd_inner: 1.49 | bwd_allreduce: 38.86 | step: 2.23
 48%|████▊     | 1688/3507 [41:19<35:49,  1.18s/it]                                                   {'loss': 0.3997, 'learning_rate': 1.1092430886745124e-05, 'epoch': 0.48}
 48%|████▊     | 1688/3507 [41:19<35:49,  1.18s/it]tensor([[-3.0938,  0.2324,  2.0625, -2.2188, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.6250, -5.6875, -2.1875, -2.3281, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.3125,  0.6758,  1.2500, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:06,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.99 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8125, -2.7188,  2.0469,  0.1904, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -1.3125,  2.5469, -0.6289, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -4.5312,  0.2285,  2.5938, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -2.2031,  1.8672, -0.0300, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -2.9219,  1.8281, -1.3594, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:08,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:26:08,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.15 | bwd_microstep: 2356.19 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2354.98 | step_microstep: 1.90
[2025-11-06 18:26:08,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 530.17 | bwd: 2357.15 | bwd_inner: 2.01 | bwd_allreduce: 2355.02 | step: 1.97
 48%|████▊     | 1689/3507 [41:22<51:43,  1.71s/it]                                                   {'loss': 0.4719, 'learning_rate': 1.1083248444849058e-05, 'epoch': 0.48}
 48%|████▊     | 1689/3507 [41:22<51:43,  1.71s/it]tensor([[-5.0938, -3.6406,  0.0452,  1.3750, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:08,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.18 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-3.2500, -2.4062,  0.1030,  2.0938, -1.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-1.3750, -1.6406, -0.1543,  3.4062,  0.3008]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -4.7188, -1.0000,  1.9844, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -3.7500, -1.3672,  3.1406, -0.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[ 0.4980,  3.4219,  4.7500,  0.8086, -0.2559]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -0.9688,  2.7969, -0.3652, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -3.8281,  0.1055,  2.2500, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:09,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:26:09,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.94 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 0.63 | step_microstep: 1.99
[2025-11-06 18:26:09,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.15 | bwd: 3.12 | bwd_inner: 2.34 | bwd_allreduce: 0.66 | step: 2.07
 48%|████▊     | 1690/3507 [41:22<40:13,  1.33s/it]                                                   {'loss': 0.5932, 'learning_rate': 1.1074065078649647e-05, 'epoch': 0.48}
 48%|████▊     | 1690/3507 [41:23<40:13,  1.33s/it]tensor([[-4.3750, -1.3594,  1.9922, -0.4414, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -4.2812, -0.2441,  1.4688, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2031, -3.0156, -2.0781,  2.0938, -0.1943]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:09,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.25 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.8125, -3.4062,  0.5625,  2.2656, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281, -1.8281,  1.1016,  1.6484, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -1.8906,  1.5156,  2.3594, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -0.4277,  3.4688, -1.8516, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9062, -4.4688,  0.9648,  1.0859, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:11,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:26:11,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.36 | bwd_microstep: 1567.28 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1566.12 | step_microstep: 2.38
[2025-11-06 18:26:11,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.64 | bwd: 1568.12 | bwd_inner: 1.83 | bwd_allreduce: 1566.16 | step: 2.45
 48%|████▊     | 1691/3507 [41:24<45:52,  1.52s/it]                                                   {'loss': 0.455, 'learning_rate': 1.106488079598278e-05, 'epoch': 0.48}
 48%|████▊     | 1691/3507 [41:24<45:52,  1.52s/it]tensor([[-4.7812, -2.4062,  1.9141,  1.2656, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6562, -0.5664,  0.2637, -3.5156, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:3')
tensor([[-4.1875, -2.7031,  0.2598,  0.9258, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:11,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.97 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6875, -2.2500,  2.4531, -0.2949, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875,  0.2012,  3.6875, -2.1250, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5000, -3.7969, -1.3516,  2.8594, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -2.2812,  1.8828, -0.0591, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -0.3398,  2.5938, -2.2031, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:11,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:26:11,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.39 | bwd_microstep: 2.11 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.68
[2025-11-06 18:26:11,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.39 | bwd: 3.08 | bwd_inner: 2.12 | bwd_allreduce: 0.82 | step: 1.77
 48%|████▊     | 1692/3507 [41:25<36:15,  1.20s/it]                                                   {'loss': 1.0171, 'learning_rate': 1.1055695604685133e-05, 'epoch': 0.48}
 48%|████▊     | 1692/3507 [41:25<36:15,  1.20s/it]tensor([[-0.8633,  2.2031,  3.7031,  0.5391, -1.1328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[1.4922, 1.7188, 3.6875, 6.7188, 2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:11,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.47 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.4375, -3.6094, -0.4473,  1.6562, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -1.6875,  3.1562,  1.2500, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -2.9844, -1.2344,  1.6250, -1.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -1.6875,  2.6562,  0.3828, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -2.7188,  0.5039, -2.2188, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125, -2.8125, -0.4180,  3.0312, -0.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:26:13,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:26:13,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.14 | bwd_microstep: 1020.16 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1018.93 | step_microstep: 2.32
[2025-11-06 18:26:13,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.62 | bwd: 1021.08 | bwd_inner: 1.95 | bwd_allreduce: 1018.98 | step: 2.40
 48%|████▊     | 1693/3507 [41:26<38:18,  1.27s/it]                                                   {'loss': 0.5505, 'learning_rate': 1.1046509512594148e-05, 'epoch': 0.48}
 48%|████▊     | 1693/3507 [41:26<38:18,  1.27s/it]tensor([[-4.6562, -5.2188, -2.8750,  1.6562, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -3.5938, -0.3027,  2.4531, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -3.2969,  0.1982,  1.5469, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:13,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.58 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7344, -1.5312,  1.5312,  3.0156, -1.2891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -4.2500, -0.3066,  2.3906, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000,  1.7188,  4.6875, -0.9922, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6719, -3.9844, -1.5469,  2.6875, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4062, -0.4355,  1.3047, -0.0557, -1.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:14,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:26:14,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.88 | bwd_microstep: 661.38 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 660.46 | step_microstep: 2.14
[2025-11-06 18:26:14,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.47 | bwd: 662.26 | bwd_inner: 1.61 | bwd_allreduce: 660.51 | step: 2.22
 48%|████▊     | 1694/3507 [41:28<40:52,  1.35s/it]                                                   {'loss': 0.4802, 'learning_rate': 1.1037322527548046e-05, 'epoch': 0.48}
 48%|████▊     | 1694/3507 [41:28<40:52,  1.35s/it]tensor([[-1.9688,  1.5625,  3.1094, -1.3203, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -3.0312,  0.4980,  2.0469, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:14,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.27 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2031,  1.1406,  3.2812, -0.7070, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -5.2188, -2.2969,  1.8984, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5000,  0.2969,  3.2344, -1.7656, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688, -0.5156,  2.6719,  1.3516, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -2.5938,  1.4609,  1.0391, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -4.1875,  0.6953,  1.9141, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:17,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.20 | optimizer_step: 0.34
[2025-11-06 18:26:17,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.33 | bwd_microstep: 2133.86 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 2132.68 | step_microstep: 2.43
[2025-11-06 18:26:17,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.63 | bwd: 2134.70 | bwd_inner: 1.83 | bwd_allreduce: 2132.73 | step: 2.51
 48%|████▊     | 1695/3507 [41:30<51:36,  1.71s/it]                                                   {'loss': 0.3654, 'learning_rate': 1.1028134657385804e-05, 'epoch': 0.48}
 48%|████▊     | 1695/3507 [41:30<51:36,  1.71s/it]tensor([[-5.2500, -1.3359,  0.7109, -4.0938, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:17,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.00 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4688, -4.2188, -1.3047,  1.8594, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -1.5234,  2.4531,  0.6367, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6250, -4.5625,  0.5703,  1.1328, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -1.7031,  3.0000, -0.5273, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -0.2793,  2.1875, -0.1992, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -4.5312, -0.7852,  1.9375, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6250,  0.1221,  2.5469, -2.0938, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:17,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:26:17,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.68 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.82
[2025-11-06 18:26:17,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.69 | bwd: 2.80 | bwd_inner: 1.82 | bwd_allreduce: 0.86 | step: 1.90
 48%|████▊     | 1696/3507 [41:31<43:02,  1.43s/it]                                                   {'loss': 0.5186, 'learning_rate': 1.1018945909947157e-05, 'epoch': 0.48}
 48%|████▊     | 1696/3507 [41:31<43:02,  1.43s/it]tensor([[-3.3906, -1.9062,  1.4375,  2.4688, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.3125,  1.3125,  1.3672, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -4.8438, -0.7695,  1.3672, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4219,  0.0250,  2.7500,  1.6250, -1.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:18,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.59 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6250, -0.0067,  2.9062, -1.2891, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -1.7656,  0.6445,  1.6641, -1.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -2.5625,  1.0078, -0.7422, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -3.1719,  1.3281, -0.1201, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:26:18,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:26:18,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.25 | bwd_microstep: 64.00 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 63.18 | step_microstep: 1.64
[2025-11-06 18:26:18,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.87 | bwd: 64.80 | bwd_inner: 1.44 | bwd_allreduce: 63.22 | step: 1.72
 48%|████▊     | 1697/3507 [41:32<34:14,  1.14s/it]                                                   {'loss': 0.8237, 'learning_rate': 1.1009756293072582e-05, 'epoch': 0.48}
 48%|████▊     | 1697/3507 [41:32<34:14,  1.14s/it]tensor([[-3.2344, -0.1191,  1.5781, -1.8906, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000, -0.6914,  1.1875, -0.5508, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:18,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.63 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.7969, -0.2393,  2.7031, -1.0938, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.4375, -6.3125, -1.6250, -1.2734, -6.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6641, -0.3340,  1.0000,  1.0703, -0.9648]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -1.8359,  2.3281, -0.4316, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -1.7109,  2.1094,  0.4629, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -3.8594, -0.5273,  2.0781, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:21,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:26:21,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.45 | bwd_microstep: 1.87 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.30
[2025-11-06 18:26:21,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.10 | bwd: 2.52 | bwd_inner: 1.55 | bwd_allreduce: 0.84 | step: 2.38
 48%|████▊     | 1698/3507 [41:35<50:01,  1.66s/it]                                                   {'loss': 0.3043, 'learning_rate': 1.100056581460331e-05, 'epoch': 0.48}
 48%|████▊     | 1698/3507 [41:35<50:01,  1.66s/it]tensor([[-5.9375, -3.2344,  1.5312,  0.6406, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -4.3438, -0.2676,  1.4688, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -0.7578,  1.8281, -1.5156, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:21,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.15 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2812, -2.3281,  2.0938,  0.4512, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -4.3438,  0.6367,  2.4062, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -3.0625,  0.3164,  2.6562, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.8281, -0.5586,  2.0000, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.1250, -4.3125,  0.9180,  0.2012, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:21,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:26:21,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.09 | bwd_microstep: 39.93 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 38.89 | step_microstep: 1.45
[2025-11-06 18:26:21,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.27 | bwd: 40.96 | bwd_inner: 1.89 | bwd_allreduce: 38.93 | step: 1.55
 48%|████▊     | 1699/3507 [41:35<39:11,  1.30s/it]                                                   {'loss': 0.1973, 'learning_rate': 1.0991374482381293e-05, 'epoch': 0.48}
 48%|████▊     | 1699/3507 [41:35<39:11,  1.30s/it]tensor([[-5.0000, -5.1562, -2.3750,  1.7812, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.4219, -1.8984,  0.8477,  1.5000, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.0625,  0.7695,  0.9727, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -0.6680,  3.1875, -0.3809, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -2.6094,  2.3125,  2.3594, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.3535,  2.5469,  2.3906, -1.9297, -1.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:22,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.57 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1562, -2.9219,  2.0938,  0.0796, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -3.5000,  1.5156,  1.1797, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:22,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:26:22,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.42 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.18
[2025-11-06 18:26:22,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 448.01 | bwd: 2.80 | bwd_inner: 1.79 | bwd_allreduce: 0.88 | step: 2.26
 48%|████▊     | 1700/3507 [41:36<36:17,  1.21s/it]                                                   {'loss': 0.979, 'learning_rate': 1.0982182304249222e-05, 'epoch': 0.48}
 48%|████▊     | 1700/3507 [41:36<36:17,  1.21s/it]tensor([[-5.3125, -2.6875,  1.4531,  0.1436, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -3.1562,  1.5391,  0.8711, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -4.0938, -0.4902,  0.2793, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3438, -2.9688,  2.1094, -0.1328, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:22,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -4.5625, -1.6094,  2.9219, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.6250,  0.2422,  1.1875, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4238,  2.8438,  3.1094, -1.3984, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.1250, -3.7188,  0.4688,  2.2188, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:23,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:26:23,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.54 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.59
[2025-11-06 18:26:23,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.55 | bwd: 2.83 | bwd_inner: 1.85 | bwd_allreduce: 0.86 | step: 1.67
 49%|████▊     | 1701/3507 [41:37<30:23,  1.01s/it]                                                   {'loss': 0.5409, 'learning_rate': 1.0972989288050511e-05, 'epoch': 0.49}
 49%|████▊     | 1701/3507 [41:37<30:23,  1.01s/it]tensor([[-3.5781, -3.8281, -1.4609,  2.5781, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -2.0781,  2.2500,  0.1074, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -3.8438, -0.8789,  1.0703, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:23,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.69 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1875,  0.6016,  2.8750, -1.8281, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -0.8555,  2.6719, -0.2949, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -3.1406,  1.1406,  1.6406, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -0.2432,  2.6875, -1.5156, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -3.6406, -0.9727,  2.1719, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:26:25,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.65 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:26:25,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.47 | bwd_microstep: 1153.57 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 1152.66 | step_microstep: 3.55
[2025-11-06 18:26:25,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.18 | bwd: 1154.27 | bwd_inner: 1.41 | bwd_allreduce: 1152.71 | step: 3.63
 49%|████▊     | 1702/3507 [41:39<41:45,  1.39s/it]                                                   {'loss': 0.1416, 'learning_rate': 1.0963795441629275e-05, 'epoch': 0.49}
 49%|████▊     | 1702/3507 [41:39<41:45,  1.39s/it]tensor([[-5.6250, -2.7031,  1.8906,  0.1016, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000, -0.0408,  2.2500, -0.3223, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -3.7188,  1.5000,  1.2188, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -5.3125, -1.6562,  0.8086, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719, -2.4219,  0.4609,  2.5938, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -2.9844, -0.0566,  3.6406, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -3.5625,  0.7422,  1.8125, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:26,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.10 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0938, -0.3828,  3.1719, -1.0156, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:27,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.35 | optimizer_step: 0.29
[2025-11-06 18:26:27,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.45 | bwd_microstep: 3.10 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 1.43 | step_microstep: 2.87
[2025-11-06 18:26:27,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 450.61 | bwd: 4.28 | bwd_inner: 2.57 | bwd_allreduce: 1.48 | step: 2.96
 49%|████▊     | 1703/3507 [41:41<47:47,  1.59s/it]                                                   {'loss': 0.2247, 'learning_rate': 1.0954600772830352e-05, 'epoch': 0.49}
 49%|████▊     | 1703/3507 [41:41<47:47,  1.59s/it]tensor([[-4.5625, -3.8281, -0.3223,  2.3750, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -5.0312, -1.2812,  1.9297, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -4.5625, -1.7344,  2.5469, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:27,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 285.86 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5312, -0.9961,  2.1094, -1.7422, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.0312, -1.6250,  2.3125, -1.4062, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -2.5156,  0.2832, -0.5977, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -1.3906,  2.2188,  0.5547, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -4.3750, -0.2178,  1.9531, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:26:28,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 18:26:28,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.84 | bwd_microstep: 737.65 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 736.48 | step_microstep: 2.11
[2025-11-06 18:26:28,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.70 | bwd: 738.40 | bwd_inner: 1.70 | bwd_allreduce: 736.53 | step: 2.19
 49%|████▊     | 1704/3507 [41:42<44:03,  1.47s/it]                                                   {'loss': 0.9576, 'learning_rate': 1.094540528949928e-05, 'epoch': 0.49}
 49%|████▊     | 1704/3507 [41:42<44:03,  1.47s/it]tensor([[-4.7188, -1.3047,  2.3906, -0.7695, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -4.1250, -0.6797,  2.1719, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.1875,  1.5000,  2.4844, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000,  1.1016,  3.0469, -2.8438, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -3.0156,  1.7422,  1.6484, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -5.0938, -1.3047,  0.7461, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -0.7266,  1.9531, -1.2891, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:29,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.15 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6562, -3.6719, -0.3398,  1.7578, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:29,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:26:29,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.40 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.01
[2025-11-06 18:26:29,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.56 | bwd: 2.57 | bwd_inner: 1.62 | bwd_allreduce: 0.83 | step: 2.09
 49%|████▊     | 1705/3507 [41:43<38:35,  1.29s/it]                                                   {'loss': 0.8673, 'learning_rate': 1.093620899948228e-05, 'epoch': 0.49}
 49%|████▊     | 1705/3507 [41:43<38:35,  1.29s/it]tensor([[-0.3418,  2.9531,  2.7344, -1.8828, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:26:29,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.14 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.6562, -2.8281,  0.9141,  1.1250, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -1.7344,  2.0312, -0.7266, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281, -0.1562,  2.3594, -0.8906, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -2.7812,  0.3301,  1.6250, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -3.6250,  1.4453,  1.5234, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -2.2500,  1.8516,  0.3926, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -4.2188, -2.3594,  1.9922, -1.3203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:30,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:26:30,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.02 | bwd_microstep: 1.83 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.21
[2025-11-06 18:26:30,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 240.17 | bwd: 2.77 | bwd_inner: 1.70 | bwd_allreduce: 0.92 | step: 2.30
 49%|████▊     | 1706/3507 [41:44<38:03,  1.27s/it]                                                   {'loss': 0.3669, 'learning_rate': 1.092701191062628e-05, 'epoch': 0.49}
 49%|████▊     | 1706/3507 [41:44<38:03,  1.27s/it]tensor([[-6.6875, -4.5000,  0.2969,  0.9883, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -0.3652,  3.1406, -1.0938, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5156, -3.1719, -2.0156,  1.8750, -0.5430]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.6719,  1.6172,  1.7188, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.7812, -0.3164,  1.9219, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -4.7188, -1.5625,  2.9531, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6562, -3.4219, -2.2500,  1.9609, -0.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:31,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.02 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5312, -4.5625, -0.5156,  2.0156, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:31,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:26:31,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.67 | bwd_microstep: 2.30 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.24
[2025-11-06 18:26:31,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.71 | bwd: 3.17 | bwd_inner: 2.06 | bwd_allreduce: 0.95 | step: 2.31
 49%|████▊     | 1707/3507 [41:45<36:45,  1.23s/it]                                                   {'loss': 0.1944, 'learning_rate': 1.091781403077887e-05, 'epoch': 0.49}
 49%|████▊     | 1707/3507 [41:45<36:45,  1.23s/it]tensor([[-3.0781,  0.4199,  2.7656, -1.3750, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:32,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 118.20 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.0312, -4.8750, -1.5156,  2.6094, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3594, -2.7656, -1.5391,  2.0469, -0.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.8750, -3.1250,  0.5898,  1.1094, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -3.6250,  1.2344,  0.4551, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -1.6875,  2.8281, -0.9180, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -3.6406,  0.4746,  2.7500, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.9062, -0.4023,  1.5859, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:26:34,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.26 | optimizer_step: 0.22
[2025-11-06 18:26:34,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.57 | bwd_microstep: 1954.33 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1953.27 | step_microstep: 3.15
[2025-11-06 18:26:34,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.79 | bwd: 1955.05 | bwd_inner: 1.56 | bwd_allreduce: 1953.32 | step: 3.24
 49%|████▊     | 1708/3507 [41:48<49:11,  1.64s/it]                                                   {'loss': 0.7423, 'learning_rate': 1.0908615367788331e-05, 'epoch': 0.49}
 49%|████▊     | 1708/3507 [41:48<49:11,  1.64s/it]tensor([[-4.9688, -3.1719,  0.8672,  1.8203, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9531, -1.4688,  1.4297,  2.0469, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -2.4688,  1.5391, -2.1719, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -4.2188, -1.4844,  1.1094, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -4.1250, -0.4180, -0.3066, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -3.7969, -1.3750,  2.0312, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -1.5703,  2.7031, -1.3984, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:36,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.19 | bwd_microstep: 1.28 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-3.6719, -0.5156,  1.6562, -1.3203, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:36,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:26:36,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.71 | bwd_microstep: 1.90 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.89
[2025-11-06 18:26:36,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.89 | bwd: 3.18 | bwd_inner: 2.11 | bwd_allreduce: 0.86 | step: 3.04
 49%|████▊     | 1709/3507 [41:50<50:22,  1.68s/it]                                                   {'loss': 0.2325, 'learning_rate': 1.0899415929503602e-05, 'epoch': 0.49}
 49%|████▊     | 1709/3507 [41:50<50:22,  1.68s/it]tensor([[-4.0000, -2.6875,  1.0391,  2.7344, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:36,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.48 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8750, -3.6875, -2.1875,  2.5000, -0.6055]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156,  0.8750,  4.0312, -2.0625, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -0.3086,  3.4844,  1.4062, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -1.9922,  2.7188, -0.0825, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7344, -2.6719, -0.4258,  0.8242, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -4.9688, -1.9297,  2.4062, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -2.1250,  1.2578,  2.4062, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:26:37,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:26:37,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.90 | bwd_microstep: 718.04 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 716.81 | step_microstep: 2.11
[2025-11-06 18:26:37,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.41 | bwd: 719.00 | bwd_inner: 2.02 | bwd_allreduce: 716.85 | step: 2.19
 49%|████▉     | 1710/3507 [41:51<45:17,  1.51s/it]                                                   {'loss': 1.0509, 'learning_rate': 1.0890215723774289e-05, 'epoch': 0.49}
 49%|████▉     | 1710/3507 [41:51<45:17,  1.51s/it]tensor([[-4.7500, -0.6094,  3.2344, -1.6406, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:37,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.50 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1875, -1.0547,  2.1562, -0.5469, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5234,  1.2031,  2.3125, -1.1016, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.1250, -0.0466,  3.5938, -1.3516, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7188,  0.2451,  2.5156, -0.3887, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -3.8594,  1.3359, -0.1963, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2188, -5.1562, -0.3672,  2.6250, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -2.1406,  1.5938,  0.4707, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:26:38,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:26:38,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.34 | bwd_microstep: 565.27 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 564.11 | step_microstep: 2.30
[2025-11-06 18:26:38,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 209.86 | bwd: 566.19 | bwd_inner: 1.91 | bwd_allreduce: 564.15 | step: 2.38
 49%|████▉     | 1711/3507 [41:52<38:55,  1.30s/it]                                                   {'loss': 0.2911, 'learning_rate': 1.088101475845065e-05, 'epoch': 0.49}
 49%|████▉     | 1711/3507 [41:52<38:55,  1.30s/it]tensor([[-4.5938, -3.3125,  0.4824,  2.1406, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.6562, -1.2656,  1.6875, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -3.0000,  1.5078,  1.2344, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:38,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.68 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7500, -3.5156, -0.4297, -1.4375, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6641,  1.6797,  2.6094, -2.0312, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -5.1562, -1.0625,  2.8438, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -3.6875,  1.0000,  2.5625, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -1.1797,  3.0312,  0.4082, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:26:38,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.25 | optimizer_step: 0.23
[2025-11-06 18:26:38,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.43 | bwd_microstep: 134.37 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 133.36 | step_microstep: 2.33
[2025-11-06 18:26:38,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.12 | bwd: 135.23 | bwd_inner: 1.64 | bwd_allreduce: 133.42 | step: 2.42
 49%|████▉     | 1712/3507 [41:52<32:28,  1.09s/it]                                                   {'loss': 0.2298, 'learning_rate': 1.0871813041383596e-05, 'epoch': 0.49}
 49%|████▉     | 1712/3507 [41:52<32:28,  1.09s/it]tensor([[-5.3750, -4.1250, -0.4512,  0.8828, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:39,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.70 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6562, -5.2188, -1.5078,  1.8828, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -1.9219,  1.7891,  1.5469, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.8438, -3.9062,  1.3984,  0.3457, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.6875, -7.6875, -3.3906, -0.5703, -5.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -4.1562, -0.6680,  1.3516, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3125, -2.5625, -0.1069,  4.0625, -0.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8438, -3.0938,  0.1846,  2.8438, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:26:41,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:26:41,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.53 | bwd_microstep: 920.44 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 919.30 | step_microstep: 2.15
[2025-11-06 18:26:41,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 471.28 | bwd: 921.31 | bwd_inner: 1.79 | bwd_allreduce: 919.34 | step: 2.24
 49%|████▉     | 1713/3507 [41:55<44:25,  1.49s/it]                                                   {'loss': 0.7847, 'learning_rate': 1.086261058042467e-05, 'epoch': 0.49}
 49%|████▉     | 1713/3507 [41:55<44:25,  1.49s/it]tensor([[-3.6250, -0.0120,  2.1094, -2.3750, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.6250, -5.0625,  0.1953,  2.3594, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -6.0625, -2.2188,  1.2969, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:41,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.24 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.3438, -3.8594, -0.0349,  3.5938, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9961,  0.7695,  3.4844,  3.2500, -0.3477]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.2188,  0.8164,  0.8906, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -4.9062, -2.2812,  2.8125, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7656,  0.5469,  2.5156, -1.4453, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:26:42,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 18:26:42,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.97 | bwd_microstep: 1094.27 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1093.03 | step_microstep: 2.20
[2025-11-06 18:26:42,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.24 | bwd: 1095.37 | bwd_inner: 2.14 | bwd_allreduce: 1093.09 | step: 2.29
 49%|████▉     | 1714/3507 [41:56<44:37,  1.49s/it]                                                   {'loss': 0.4888, 'learning_rate': 1.0853407383426058e-05, 'epoch': 0.49}
 49%|████▉     | 1714/3507 [41:56<44:37,  1.49s/it]tensor([[-3.1719, -2.1250,  0.6562,  2.0625, -1.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6875, -5.7188, -0.7031,  0.1196, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:43,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.14 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7188, -3.7969, -1.2656,  2.6875, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -0.5625,  2.6875, -0.5234, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2656,  1.0703,  3.1719, -0.4199, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -5.8125, -2.6406,  1.6484, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -1.6719,  1.4688,  0.8945, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500, -1.5938,  1.1875,  2.4062, -1.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:26:43,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:26:43,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.22 | bwd_microstep: 444.79 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 443.84 | step_microstep: 1.86
[2025-11-06 18:26:43,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.39 | bwd: 445.51 | bwd_inner: 1.46 | bwd_allreduce: 443.88 | step: 1.94
 49%|████▉     | 1715/3507 [41:57<39:11,  1.31s/it]                                                   {'loss': 0.613, 'learning_rate': 1.0844203458240574e-05, 'epoch': 0.49}
 49%|████▉     | 1715/3507 [41:57<39:11,  1.31s/it]tensor([[-1.4219,  1.1719,  2.6875,  0.3984, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -1.2891,  2.2812, -1.0938, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:43,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.98 | bwd_microstep: 1.28 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.07
tensor([[-6.2812, -3.9844,  1.2891,  1.8672, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -4.2500, -1.0234,  1.7031, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -0.3047,  3.2188, -0.6641, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1406,  1.6562,  3.6250, -1.5312, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.9375, -4.8438, -1.5781,  2.5469, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[ 0.0564,  3.7812,  4.3438, -1.2969, -1.1641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:26:46,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.26 | optimizer_step: 0.22
[2025-11-06 18:26:46,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.33 | bwd_microstep: 2481.01 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 2479.65 | step_microstep: 2.59
[2025-11-06 18:26:46,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.34 | bwd: 2482.30 | bwd_inner: 2.30 | bwd_allreduce: 2479.72 | step: 2.66
 49%|████▉     | 1716/3507 [42:00<53:47,  1.80s/it]                                                   {'loss': 0.5125, 'learning_rate': 1.0834998812721647e-05, 'epoch': 0.49}
 49%|████▉     | 1716/3507 [42:00<53:47,  1.80s/it]tensor([[-4.4688, -3.7188, -0.5820,  1.7109, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -5.0938, -1.3125,  1.3516, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:46,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.61 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -1.6875,  2.1250,  1.6562, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.1094,  0.6836,  1.5781, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.5391,  1.8281, -0.8906, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -0.3164,  2.4844, -1.0312, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.2734,  0.5547,  2.1406,  1.4219, -0.8242]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.9062,  0.1348,  2.9844, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:26:47,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:26:47,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.68 | bwd_microstep: 98.16 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 97.30 | step_microstep: 1.96
[2025-11-06 18:26:47,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.31 | bwd: 98.91 | bwd_inner: 1.42 | bwd_allreduce: 97.34 | step: 2.03
 49%|████▉     | 1717/3507 [42:00<41:24,  1.39s/it]                                                   {'loss': 0.6644, 'learning_rate': 1.0825793454723325e-05, 'epoch': 0.49}
 49%|████▉     | 1717/3507 [42:00<41:24,  1.39s/it]tensor([[-4.2188, -0.2793,  2.7969, -1.7109, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:47,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.72 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[6.7500, 8.6875, 8.1875, 5.7500, 5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3906, -0.4922,  2.1094, -0.6133, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -3.8125, -1.0078,  1.6562, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9531, -2.8594,  0.4414,  2.0781, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.8281,  0.9922,  0.8203, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -5.3438, -1.1484,  2.3594, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -4.4062,  0.2715,  0.8438, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:26:49,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:26:49,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.72 | bwd_microstep: 1930.14 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1929.06 | step_microstep: 2.23
[2025-11-06 18:26:49,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 273.46 | bwd: 1931.18 | bwd_inner: 1.93 | bwd_allreduce: 1929.11 | step: 2.31
 49%|████▉     | 1718/3507 [42:03<48:58,  1.64s/it]                                                   {'loss': 0.4104, 'learning_rate': 1.0816587392100264e-05, 'epoch': 0.49}
 49%|████▉     | 1718/3507 [42:03<48:58,  1.64s/it]tensor([[-5.9062, -3.0312,  1.3047, -0.0107, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -1.1094,  2.0781, -1.3828, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:49,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.6250, -3.6875,  0.2559,  2.7500, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -3.4219,  1.4062,  0.7148, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -1.4844,  2.4375, -1.3438, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -0.4023,  2.4688, -0.8086, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -0.8672,  2.5469, -0.3125, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -2.5156,  1.0469, -0.3887, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:49,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:26:49,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.96 | bwd_microstep: 153.38 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 152.35 | step_microstep: 1.51
[2025-11-06 18:26:49,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.43 | bwd: 154.32 | bwd_inner: 1.77 | bwd_allreduce: 152.39 | step: 1.59
 49%|████▉     | 1719/3507 [42:03<39:36,  1.33s/it]                                                   {'loss': 0.3257, 'learning_rate': 1.080738063270772e-05, 'epoch': 0.49}
 49%|████▉     | 1719/3507 [42:03<39:36,  1.33s/it]tensor([[-0.9062,  2.8281,  3.7344, -1.7500, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -4.2812, -0.4922,  1.8828, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:50,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.01 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.5000, -0.0166,  1.8672, -0.6875, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7188, -3.6875, -0.7344,  3.0156, -1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8906,  1.4922,  3.3438, -0.7305, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8594,  1.3281,  4.1562, -1.6094, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -2.4844,  1.2578,  2.3906, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2422,  1.8281,  2.3281, -1.8281, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:26:52,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:26:52,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.60 | bwd_microstep: 2240.49 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 2239.58 | step_microstep: 2.28
[2025-11-06 18:26:52,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.59 | bwd: 2241.46 | bwd_inner: 1.71 | bwd_allreduce: 2239.62 | step: 2.35
 49%|████▉     | 1720/3507 [42:06<54:12,  1.82s/it]                                                   {'loss': 0.2805, 'learning_rate': 1.0798173184401548e-05, 'epoch': 0.49}
 49%|████▉     | 1720/3507 [42:06<54:12,  1.82s/it]tensor([[-4.2812, -0.7930,  3.3281,  0.0776, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -3.4688, -0.3438,  1.4062, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9766,  1.6875,  2.1094, -0.9258, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:53,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.17 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -2.7031,  0.5039,  0.0583, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8281, -3.3281, -2.0312,  1.5703, -0.8242]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -3.7969, -0.1182,  0.9062, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -1.6953,  1.9141, -1.4062, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281, -0.0879,  3.1562,  0.0292, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:53,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:26:53,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.94 | bwd_microstep: 1.74 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.73
[2025-11-06 18:26:53,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 501.14 | bwd: 2.62 | bwd_inner: 1.74 | bwd_allreduce: 0.75 | step: 1.82
 49%|████▉     | 1721/3507 [42:07<42:49,  1.44s/it]                                                   {'loss': 0.2437, 'learning_rate': 1.0788965055038179e-05, 'epoch': 0.49}
 49%|████▉     | 1721/3507 [42:07<42:49,  1.44s/it]tensor([[-4.7188, -2.1094,  1.8828,  0.8672, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:53,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.90 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8438, -2.8125,  2.2031,  0.6602, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -3.6562, -0.9102,  2.0156, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -1.6172,  2.2812,  1.0000, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3906,  0.9258,  2.7812, -1.4297, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -3.1719,  1.1016,  1.4844, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -5.1875, -2.0312,  1.8750, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -0.2734,  3.6406, -1.0078, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:54,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:26:54,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.21 | bwd_microstep: 1021.00 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1019.84 | step_microstep: 2.72
[2025-11-06 18:26:54,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.14 | bwd: 1021.95 | bwd_inner: 1.94 | bwd_allreduce: 1019.88 | step: 2.79
 49%|████▉     | 1722/3507 [42:08<42:19,  1.42s/it]                                                   {'loss': 0.5228, 'learning_rate': 1.077975625247464e-05, 'epoch': 0.49}
 49%|████▉     | 1722/3507 [42:08<42:19,  1.42s/it]tensor([[-5.6250, -3.3125,  0.7773,  0.6055, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -3.8438,  0.2109,  0.8672, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -3.2500,  0.0938,  1.0547, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:54,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.69 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4219, -3.7500, -1.7422,  2.1250, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5312, -1.1172,  1.6797,  4.8438,  0.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.6562, -0.1689,  2.8750, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8594, -2.0938,  1.0234,  0.9922, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -1.3828,  1.6641,  0.9883, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:26:55,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:26:55,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.43 | bwd_microstep: 282.80 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 281.83 | step_microstep: 1.59
[2025-11-06 18:26:55,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.15 | bwd: 283.55 | bwd_inner: 1.55 | bwd_allreduce: 281.87 | step: 1.67
 49%|████▉     | 1723/3507 [42:09<37:14,  1.25s/it]                                                   {'loss': 0.4851, 'learning_rate': 1.0770546784568523e-05, 'epoch': 0.49}
 49%|████▉     | 1723/3507 [42:09<37:14,  1.25s/it]tensor([[-2.5469,  0.8789,  2.5781, -1.8906, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.8281, -2.7344,  0.8047,  3.0938, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -3.0000,  0.3750,  0.7656, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:55,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.42 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.5938, -4.4688, -0.1582,  2.0469, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.5938,  0.0557,  1.5781, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -2.9531,  1.8281,  0.3418, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -4.6250, -1.1016,  1.6875, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -4.0625, -0.6758,  3.2188, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:26:58,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.24 | optimizer_step: 0.33
[2025-11-06 18:26:58,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.23 | bwd_microstep: 2033.32 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2032.20 | step_microstep: 2.55
[2025-11-06 18:26:58,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 441.68 | bwd: 2034.24 | bwd_inner: 1.84 | bwd_allreduce: 2032.25 | step: 2.63
 49%|████▉     | 1724/3507 [42:11<48:30,  1.63s/it]                                                   {'loss': 0.4444, 'learning_rate': 1.0761336659177992e-05, 'epoch': 0.49}
 49%|████▉     | 1724/3507 [42:11<48:30,  1.63s/it]tensor([[-5.4062, -3.6250,  0.3477,  1.0234, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:26:58,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.48 | bwd_microstep: 1.40 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.5000, -3.1094, -0.3848,  2.4375, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -2.7344,  2.6250,  0.6406, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5625, -4.5312,  0.6133,  1.4766, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5469, -1.9219,  0.6680,  2.8906, -1.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -1.7109,  1.5234,  0.6055, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3281,  2.0469,  3.5156, -2.7188, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.8516,  1.4141,  3.9219,  0.1279, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:26:58,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.23 | optimizer_step: 0.22
[2025-11-06 18:26:58,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.25 | bwd_microstep: 212.64 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 211.43 | step_microstep: 2.52
[2025-11-06 18:26:58,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.72 | bwd: 214.03 | bwd_inner: 2.32 | bwd_allreduce: 211.48 | step: 2.62
 49%|████▉     | 1725/3507 [42:12<39:10,  1.32s/it]                                                   {'loss': 0.7173, 'learning_rate': 1.0752125884161766e-05, 'epoch': 0.49}
 49%|████▉     | 1725/3507 [42:12<39:10,  1.32s/it]tensor([[-3.7031, -3.6562, -1.4688,  1.7969, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -4.0625, -0.7891,  0.9492, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -1.5625,  0.8008, -0.0271, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:26:58,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.00 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4531,  0.8867,  2.1250, -2.2344, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.1562, -3.4375, -0.1924,  2.3594, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9375, -1.3906,  0.8203,  3.0000, -0.5352]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -4.5938, -0.5586,  1.5547, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9688, -2.8594,  0.3184,  2.0625, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:27:01,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:27:01,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.18 | bwd_microstep: 2229.11 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2227.86 | step_microstep: 2.55
[2025-11-06 18:27:01,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.19 | bwd: 2229.97 | bwd_inner: 1.93 | bwd_allreduce: 2227.91 | step: 2.63
 49%|████▉     | 1726/3507 [42:15<50:52,  1.71s/it]                                                   {'loss': 0.7134, 'learning_rate': 1.0742914467379126e-05, 'epoch': 0.49}
 49%|████▉     | 1726/3507 [42:15<50:52,  1.71s/it]tensor([[-4.9375, -2.2969,  2.0312,  1.1016, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:01,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.11 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-0.5898,  0.4199,  3.1250,  4.5625,  0.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -4.3750, -1.2891,  3.2500, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -0.4492,  3.3594,  0.8047, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.2812, -0.2070,  1.9297, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -3.5781, -0.2324,  3.4375, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -2.8438,  1.1562,  0.6328, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -6.2500, -3.2500,  0.8555, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:27:01,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:27:01,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.83 | bwd_microstep: 53.99 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 52.99 | step_microstep: 1.79
[2025-11-06 18:27:01,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.95 | bwd: 54.86 | bwd_inner: 1.69 | bwd_allreduce: 53.03 | step: 1.87
 49%|████▉     | 1727/3507 [42:15<39:24,  1.33s/it]                                                   {'loss': 0.3509, 'learning_rate': 1.0733702416689895e-05, 'epoch': 0.49}
 49%|████▉     | 1727/3507 [42:15<39:24,  1.33s/it]tensor([[-6.0938, -4.9062, -0.8945,  1.4453, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -3.7188, -0.1982, -0.4297, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:02,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.58 | bwd_microstep: 1.29 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.3750, -4.3125, -1.4609,  2.3750, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -2.4062, -0.3066, -0.9492, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5312, -1.4141,  1.5547,  0.7891, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -4.0312, -0.8477,  0.9961, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -3.8594,  1.1250,  0.4121, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -3.0312,  1.6719,  2.4531, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:02,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:27:02,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.54 | bwd_microstep: 243.40 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 241.99 | step_microstep: 2.10
[2025-11-06 18:27:02,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.14 | bwd: 244.69 | bwd_inner: 2.49 | bwd_allreduce: 242.04 | step: 2.19
 49%|████▉     | 1728/3507 [42:16<33:28,  1.13s/it]                                                   {'loss': 0.5505, 'learning_rate': 1.0724489739954447e-05, 'epoch': 0.49}
 49%|████▉     | 1728/3507 [42:16<33:28,  1.13s/it]tensor([[-4.3750, -1.5000,  1.6953, -0.2969, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9688, -6.8438, -3.2344,  0.9102, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -3.3906, -0.1035,  1.8750, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -5.8750, -2.7500,  1.6641, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7812, -0.6094,  2.3281,  1.7891, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:03,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.92 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7812, -1.8047,  1.7969, -0.2451, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -2.0312,  1.1797,  4.0000, -0.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -1.8906, -0.5586, -4.2188, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:27:03,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:27:03,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.17 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.97
[2025-11-06 18:27:03,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.06 | bwd: 2.76 | bwd_inner: 1.76 | bwd_allreduce: 0.86 | step: 2.04
 49%|████▉     | 1729/3507 [42:17<35:29,  1.20s/it]                                                   {'loss': 0.3356, 'learning_rate': 1.0715276445033667e-05, 'epoch': 0.49}
 49%|████▉     | 1729/3507 [42:17<35:29,  1.20s/it]tensor([[-3.2656, -1.4141,  0.8125,  0.5156, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125,  1.6562,  4.1250, -1.9375, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -2.5312,  1.2969,  1.2969, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:04,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.63 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.9531, -2.9375,  0.5938,  2.9375, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5547,  1.8672,  2.8281, -1.9453, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.1992,  3.0469,  3.1562, -1.2969, -1.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4062,  0.6328,  2.8750, -2.4219, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625,  0.1299,  3.1250,  0.0303, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:27:04,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:27:04,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.55 | bwd_microstep: 31.38 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 30.14 | step_microstep: 1.47
[2025-11-06 18:27:04,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.20 | bwd: 32.31 | bwd_inner: 1.97 | bwd_allreduce: 30.19 | step: 1.57
 49%|████▉     | 1730/3507 [42:18<28:24,  1.04it/s]                                                   {'loss': 0.7692, 'learning_rate': 1.0706062539788995e-05, 'epoch': 0.49}
 49%|████▉     | 1730/3507 [42:18<28:24,  1.04it/s]tensor([[-3.9062, -0.5742,  2.1719, -1.5547, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.0000, -4.0938,  0.6953,  1.6875, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -1.6406,  2.7500,  0.7773, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -6.1875, -2.4062,  1.4922, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:05,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.28 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8438, -2.6875,  0.9258,  2.8750, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -4.1875, -0.0991,  1.4922, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7812, -5.4375, -0.6289,  1.7500, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -3.9531, -0.5195,  2.0312, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:27:07,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:27:07,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 547.84 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 546.62 | step_microstep: 2.07
[2025-11-06 18:27:07,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.61 | bwd: 548.72 | bwd_inner: 1.91 | bwd_allreduce: 546.66 | step: 2.15
 49%|████▉     | 1731/3507 [42:21<47:20,  1.60s/it]                                                   {'loss': 0.7251, 'learning_rate': 1.0696848032082376e-05, 'epoch': 0.49}
 49%|████▉     | 1731/3507 [42:21<47:20,  1.60s/it]tensor([[-3.6562, -2.6562,  0.0972,  1.7812, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -4.0625, -1.0938,  1.8281, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -1.9766,  1.6953, -0.0613, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:07,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.84 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8750, -3.8281, -0.2334, -0.5078, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -4.7500, -2.0000,  2.3750, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500,  0.6328,  2.8594, -2.2188, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -5.2812, -2.9375,  1.6406, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -2.8906,  0.7656,  0.3027, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:07,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:27:07,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.93 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.91
[2025-11-06 18:27:07,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.80 | bwd: 2.89 | bwd_inner: 1.98 | bwd_allreduce: 0.78 | step: 1.98
 49%|████▉     | 1732/3507 [42:21<36:44,  1.24s/it]                                                   {'loss': 0.2165, 'learning_rate': 1.0687632929776272e-05, 'epoch': 0.49}
 49%|████▉     | 1732/3507 [42:21<36:44,  1.24s/it]tensor([[-3.6406, -3.6562, -0.7305,  3.2969, -1.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -3.6406,  1.2109,  1.5781, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3438, -2.8438,  1.4062, -1.6094, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -3.7969,  0.2256,  2.1406, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -0.2061,  2.3594, -1.1797, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -3.3594,  1.8203,  0.7227, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:08,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 285.52 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -1.4922,  2.0625,  0.0723, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.8125, -5.4375,  0.1846,  0.5898, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:10,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.25 | optimizer_step: 0.22
[2025-11-06 18:27:10,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.76 | bwd_microstep: 992.22 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 991.17 | step_microstep: 2.39
[2025-11-06 18:27:10,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 532.23 | bwd: 993.10 | bwd_inner: 1.73 | bwd_allreduce: 991.23 | step: 2.47
 49%|████▉     | 1733/3507 [42:23<46:26,  1.57s/it]                                                   {'loss': 0.5659, 'learning_rate': 1.0678417240733654e-05, 'epoch': 0.49}
 49%|████▉     | 1733/3507 [42:23<46:26,  1.57s/it]tensor([[-4.4688, -3.0312,  0.4766,  1.6719, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:10,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.14 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9844, -4.0000, -0.9609,  2.9688, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -4.0625, -0.4766,  2.1406, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -0.8164,  2.2344,  0.1025, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -3.5000, -1.0000,  2.0312, -1.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4062, -4.3438,  0.3340,  1.0078, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -4.1250, -1.4141,  3.1406, -1.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -1.8047,  2.1562, -0.0903, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:27:10,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:27:10,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.71 | bwd_microstep: 211.45 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 210.18 | step_microstep: 2.52
[2025-11-06 18:27:10,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.87 | bwd: 212.47 | bwd_inner: 2.05 | bwd_allreduce: 210.24 | step: 2.62
 49%|████▉     | 1734/3507 [42:24<37:20,  1.26s/it]                                                   {'loss': 0.2979, 'learning_rate': 1.066920097281799e-05, 'epoch': 0.49}
 49%|████▉     | 1734/3507 [42:24<37:20,  1.26s/it]tensor([[-4.8125, -3.4219,  0.5898,  2.2500, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.4688, -0.5312,  2.2188, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -3.3281,  1.6875,  0.8477, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -2.2969,  0.5312,  0.0928, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[3.9844, 4.8438, 6.3125, 7.5312, 4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0938, -2.8906, -0.0757,  3.0938, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:11,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.89 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7188, -4.4062, -0.1836,  1.7969, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -0.9180,  2.6250, -0.4902, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:13,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:27:13,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.94 | bwd_microstep: 1255.86 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1254.88 | step_microstep: 2.04
[2025-11-06 18:27:13,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.84 | bwd: 1256.81 | bwd_inner: 1.71 | bwd_allreduce: 1254.93 | step: 2.14
 49%|████▉     | 1735/3507 [42:27<50:34,  1.71s/it]                                                   {'loss': 0.3755, 'learning_rate': 1.0659984133893245e-05, 'epoch': 0.49}
 49%|████▉     | 1735/3507 [42:27<50:34,  1.71s/it]tensor([[-2.9062,  0.1816,  2.3281, -1.0078, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0312, -0.0244,  2.7812,  0.4004, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -4.9062,  0.3398,  1.6016, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:13,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.31 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -3.4219,  0.1270,  1.3594, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500, -3.7344, -2.4375,  2.2969, -0.4883]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -3.0938,  0.5195,  2.5000, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -1.8672,  1.4609,  2.9219, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500, -4.0625, -1.5938,  2.4219, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 18:27:13,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:27:13,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.91 | bwd_microstep: 79.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 79.04 | step_microstep: 1.80
[2025-11-06 18:27:13,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.25 | bwd: 80.62 | bwd_inner: 1.40 | bwd_allreduce: 79.08 | step: 1.88
 50%|████▉     | 1736/3507 [42:27<39:24,  1.34s/it]                                                   {'loss': 1.2607, 'learning_rate': 1.0650766731823875e-05, 'epoch': 0.5}
 50%|████▉     | 1736/3507 [42:27<39:24,  1.34s/it]tensor([[-2.7969, -3.6875, -2.5312,  1.9297, -0.6133]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -3.2500,  1.6562,  0.9727, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -5.0312, -1.4219,  2.6719, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -4.3750,  0.1387,  3.1094, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2031, -2.1406,  0.9961,  5.0312, -0.1167]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -3.1406,  0.0273,  1.1562, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:15,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.29 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.3438, -4.3438, -1.3438,  2.6094, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1719,  1.4844,  2.6562, -2.3281, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:16,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:27:16,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.21 | bwd_microstep: 801.32 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 800.06 | step_microstep: 17.46
[2025-11-06 18:27:16,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.53 | bwd: 802.31 | bwd_inner: 2.08 | bwd_allreduce: 800.10 | step: 17.54
 50%|████▉     | 1737/3507 [42:30<50:09,  1.70s/it]                                                   {'loss': 0.1543, 'learning_rate': 1.0641548774474807e-05, 'epoch': 0.5}
 50%|████▉     | 1737/3507 [42:30<50:09,  1.70s/it]tensor([[-5.1562, -3.3125,  0.7852,  1.6484, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.1250,  1.0312,  2.0312, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -2.8594,  1.6250,  2.3594, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219,  0.0188,  3.7188, -1.0625, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:16,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.38 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.8438, -3.4688,  1.4609, -0.6797, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -0.7188,  3.1562, -1.8125, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -1.9688,  2.1250,  2.7969, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2656,  0.7852,  3.8906, -0.9844, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:16,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:27:16,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.40 | bwd_microstep: 11.19 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 10.13 | step_microstep: 1.68
[2025-11-06 18:27:16,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 481.80 | bwd: 12.06 | bwd_inner: 1.77 | bwd_allreduce: 10.16 | step: 1.75
 50%|████▉     | 1738/3507 [42:30<39:49,  1.35s/it]                                                   {'loss': 0.6102, 'learning_rate': 1.0632330269711449e-05, 'epoch': 0.5}
 50%|████▉     | 1738/3507 [42:30<39:49,  1.35s/it]tensor([[-2.8594, -3.5781, -2.1250,  2.1719, -0.6836]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.5156, -2.0156,  1.8750,  3.1406, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3594, -3.1250, -2.4844,  1.2578, -0.4473]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:27:17,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.43 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.0156, -3.6875, -2.2969,  1.9297, -0.8398]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -3.1562,  1.1953,  2.0469, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -2.7969,  1.0859,  1.0547, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -4.4688, -1.4844,  2.6250, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.1484,  2.6719,  3.6719, -1.5781, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:19,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:27:19,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.41 | bwd_microstep: 2457.49 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2456.42 | step_microstep: 2.14
[2025-11-06 18:27:19,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.87 | bwd: 2458.34 | bwd_inner: 1.71 | bwd_allreduce: 2456.46 | step: 2.22
 50%|████▉     | 1739/3507 [42:33<52:35,  1.78s/it]                                                   {'loss': 1.092, 'learning_rate': 1.0623111225399674e-05, 'epoch': 0.5}
 50%|████▉     | 1739/3507 [42:33<52:35,  1.78s/it]tensor([[-5.0625, -3.7969,  0.3301,  2.4375, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -4.8750, -1.5469,  2.9688, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-5.8125, -4.3125,  0.1108,  1.5781, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:2')
tensor([3], device='cuda:3')
tensor([[-5.3438, -3.4531,  0.6992,  1.2656, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:19,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.70 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4141,  1.4688,  2.5156, -0.7773, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -3.8125, -0.7383,  1.7578, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5625, -3.1094,  1.3906,  1.0859, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.0312, -0.1709,  3.7344, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:27:20,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:27:20,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.01 | bwd_microstep: 39.66 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 38.54 | step_microstep: 1.90
[2025-11-06 18:27:20,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.74 | bwd: 40.91 | bwd_inner: 2.18 | bwd_allreduce: 38.58 | step: 1.99
 50%|████▉     | 1740/3507 [42:33<40:39,  1.38s/it]                                                   {'loss': 0.3395, 'learning_rate': 1.0613891649405816e-05, 'epoch': 0.5}
 50%|████▉     | 1740/3507 [42:33<40:39,  1.38s/it]tensor([[-7.5938, -7.5938, -4.1875,  0.2227, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2969,  2.0469,  2.6562, -1.9844, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -0.0098,  2.2188, -0.9336, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -5.3125, -1.6719,  2.0312, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:20,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.52 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-4.5312, -3.9375, -0.6211,  2.4375, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -3.6719, -0.4648,  3.4844, -1.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -1.4531,  2.1719,  1.5938, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.6562, -0.7422,  2.4844,  0.1235, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:21,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:27:21,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.00 | bwd_microstep: 1191.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1190.99 | step_microstep: 1.98
[2025-11-06 18:27:21,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.55 | bwd: 1192.81 | bwd_inner: 1.59 | bwd_allreduce: 1191.05 | step: 2.11
 50%|████▉     | 1741/3507 [42:35<42:52,  1.46s/it]                                                   {'loss': 0.6211, 'learning_rate': 1.0604671549596661e-05, 'epoch': 0.5}
 50%|████▉     | 1741/3507 [42:35<42:52,  1.46s/it]tensor([[-3.3281, -2.2188,  0.8945,  2.7812, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -3.5312, -0.5273,  3.4219, -1.2891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7812, -3.2188, -0.5703,  1.9844, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:21,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.03 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.3594, -0.7305,  1.8125, -0.3848, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3438, -5.3125, -1.0938,  1.3672, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5156,  1.6250,  3.2188, -2.7969, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -5.1250, -1.1797,  2.1406, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.5469, -0.4785,  1.7734, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:27:22,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:27:22,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 77.25 | bwd_microstep: 357.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 357.00 | step_microstep: 2.22
[2025-11-06 18:27:22,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.27 | bwd: 358.88 | bwd_inner: 1.66 | bwd_allreduce: 357.06 | step: 2.33
 50%|████▉     | 1742/3507 [42:36<35:55,  1.22s/it]                                                   {'loss': 0.1144, 'learning_rate': 1.0595450933839444e-05, 'epoch': 0.5}
 50%|████▉     | 1742/3507 [42:36<35:55,  1.22s/it]tensor([[-2.5000,  1.2344,  3.1562, -1.9141, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1562,  1.3438,  3.0000,  0.9375, -1.0859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -3.5625,  0.8906,  1.0781, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -2.9844,  1.7656,  0.7695, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.3438, -4.6875, -0.4609,  0.9062, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-5.1562, -3.4844,  0.5000,  1.6016, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -2.6406,  0.8789,  2.7188, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:24,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.15 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.6875, -3.4844,  1.3516, -0.8477, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:24,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.25 | optimizer_step: 0.26
[2025-11-06 18:27:24,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.88 | bwd_microstep: 3.81 | bwd_inner_microstep: 2.32 | bwd_allreduce_microstep: 1.28 | step_microstep: 3.31
[2025-11-06 18:27:24,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 489.06 | bwd: 4.89 | bwd_inner: 3.32 | bwd_allreduce: 1.31 | step: 3.41
 50%|████▉     | 1743/3507 [42:38<44:01,  1.50s/it]                                                   {'loss': 0.692, 'learning_rate': 1.058622981000184e-05, 'epoch': 0.5}
 50%|████▉     | 1743/3507 [42:38<44:01,  1.50s/it]tensor([[-4.4062, -1.5312,  1.5781, -0.7109, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -0.6211,  3.2656,  1.0625, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -4.3125, -0.2119,  2.2188, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:24,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.61 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.6250,  0.4805,  3.0312, -2.3281, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -3.7500,  0.4512,  1.2109, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1875,  0.8086,  1.3203, -0.4180, -1.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.3750, -3.5312, -0.6953, -3.0625, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -1.5234,  1.9609,  1.1016, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:27:25,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.62 | optimizer_step: 0.55
[2025-11-06 18:27:25,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.40 | bwd_microstep: 37.58 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 36.16 | step_microstep: 5.33
[2025-11-06 18:27:25,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.05 | bwd: 38.47 | bwd_inner: 1.97 | bwd_allreduce: 36.26 | step: 5.38
 50%|████▉     | 1744/3507 [42:38<34:56,  1.19s/it]                                                   {'loss': 0.3174, 'learning_rate': 1.057700818595195e-05, 'epoch': 0.5}
 50%|████▉     | 1744/3507 [42:38<34:56,  1.19s/it]tensor([[-4.2812, -2.7656,  0.6016,  1.4062, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -4.3750, -1.1719,  2.2812, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -2.1406,  2.4531,  0.6992, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -5.3438, -1.8672,  1.9297, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.8125,  1.1797,  1.2969, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -3.7812, -0.1787, -0.7031, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.0625,  2.2656,  0.7539, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:26,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.55 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2188, -0.6133,  2.6719, -1.0391, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:27,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.46 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:27:27,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.05 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.82 | step_microstep: 4.23
[2025-11-06 18:27:27,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 474.62 | bwd: 2.71 | bwd_inner: 1.70 | bwd_allreduce: 0.87 | step: 4.32
 50%|████▉     | 1745/3507 [42:41<44:08,  1.50s/it]                                                   {'loss': 0.3368, 'learning_rate': 1.0567786069558321e-05, 'epoch': 0.5}
 50%|████▉     | 1745/3507 [42:41<44:08,  1.50s/it]tensor([[-4.7812, -4.7500, -1.7109,  2.1719, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:27,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.82 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9062, -4.8750, -3.3438,  1.7656, -1.3047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -4.3125, -0.6250,  1.6797, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000,  0.2197,  4.1562,  0.5234, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.7188, -1.5156,  2.1406, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2188,  1.8438,  3.2656, -2.7500, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -5.0938, -0.4062,  2.2969, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -1.7969,  1.5469, -0.0422, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:27:28,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.86 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:27:28,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.96 | bwd_microstep: 484.23 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 482.96 | step_microstep: 4.35
[2025-11-06 18:27:28,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.80 | bwd: 485.10 | bwd_inner: 1.94 | bwd_allreduce: 483.00 | step: 4.43
 50%|████▉     | 1746/3507 [42:41<38:02,  1.30s/it]                                                   {'loss': 0.0985, 'learning_rate': 1.0558563468689902e-05, 'epoch': 0.5}
 50%|████▉     | 1746/3507 [42:41<38:02,  1.30s/it]tensor([[-2.1719, -2.8125, -1.1875,  3.2656, -0.0859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -4.6250, -2.9844,  1.5000, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -1.6328,  2.9219, -0.0967, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625, -0.4453,  2.0000,  0.1719, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -1.2266,  1.3594,  0.0135, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469, -2.3438,  0.7188,  1.8984, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:30,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.09 | bwd_microstep: 3.49 | bwd_inner_microstep: 3.36 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0312, -4.3750,  0.1030,  1.6250, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7188, -4.1875,  0.1357,  1.8984, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:30,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.21 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:27:30,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.96 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.78 | step_microstep: 5.30
[2025-11-06 18:27:30,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.07 | bwd: 5.12 | bwd_inner: 4.15 | bwd_allreduce: 0.82 | step: 5.38
 50%|████▉     | 1747/3507 [42:44<48:52,  1.67s/it]                                                   {'loss': 0.5689, 'learning_rate': 1.0549340391216058e-05, 'epoch': 0.5}
 50%|████▉     | 1747/3507 [42:44<48:52,  1.67s/it]tensor([[-1.5469,  2.6719,  3.4062, -3.1406, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5000, -3.4375,  0.5039,  3.1094, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.9062, -0.8984,  2.0312, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.1094,  1.4141,  3.2188, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:30,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.08 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.52 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3750, -4.5000, -0.6289,  2.0312, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -5.3438, -1.6484,  2.8750, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -1.7344,  1.5156,  0.0762, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0781, -2.1719, -0.7266,  2.3125, -0.4258]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 18:27:31,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.30 | optimizer_gradients: 0.28 | optimizer_step: 0.22
[2025-11-06 18:27:31,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.10 | bwd_microstep: 34.38 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 33.03 | step_microstep: 4.00
[2025-11-06 18:27:31,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.19 | bwd: 35.01 | bwd_inner: 1.76 | bwd_allreduce: 33.07 | step: 4.08
 50%|████▉     | 1748/3507 [42:44<38:08,  1.30s/it]                                                   {'loss': 0.9832, 'learning_rate': 1.0540116845006568e-05, 'epoch': 0.5}
 50%|████▉     | 1748/3507 [42:44<38:08,  1.30s/it]tensor([[-0.2480,  2.2500,  2.4375, -0.6680, -0.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.2344, -2.9219, -0.5273,  2.4062, -1.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -2.4062,  2.4062,  0.1201, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.4844,  0.2559,  3.4844, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1406, -0.6055,  1.7266, -0.3867, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969, -3.8906, -1.8047,  2.7500, -0.9648]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:32,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.76 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5938, -1.6406,  3.1250, -0.6836, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7656,  0.4902,  2.1250, -1.3984, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:34,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:27:34,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.84 | bwd_microstep: 1692.43 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1691.28 | step_microstep: 2.68
[2025-11-06 18:27:34,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.60 | bwd: 1693.39 | bwd_inner: 1.90 | bwd_allreduce: 1691.33 | step: 2.77
 50%|████▉     | 1749/3507 [42:47<52:33,  1.79s/it]                                                   {'loss': 0.4641, 'learning_rate': 1.0530892837931603e-05, 'epoch': 0.5}
 50%|████▉     | 1749/3507 [42:47<52:33,  1.79s/it]tensor([[-6.0312, -3.8594,  0.9883,  1.4375, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:34,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.80 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.1562, -4.1562, -1.2656,  2.6719, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4219, -3.1719, -1.9141,  2.3438, -0.3066]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3125, -1.0781,  2.5781, -0.1523, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.8750,  0.8320,  2.9219, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -2.2031,  1.0156, -0.9453, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4062,  0.9688,  3.9531, -2.1406, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -4.1250, -0.1641, -1.1797, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:34,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.27 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:27:34,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.62 | bwd_microstep: 316.07 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 314.72 | step_microstep: 3.74
[2025-11-06 18:27:34,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 216.43 | bwd: 317.14 | bwd_inner: 2.26 | bwd_allreduce: 314.75 | step: 3.81
 50%|████▉     | 1750/3507 [42:48<41:44,  1.43s/it]                                                   {'loss': 0.8304, 'learning_rate': 1.0521668377861734e-05, 'epoch': 0.5}
 50%|████▉     | 1750/3507 [42:48<41:44,  1.43s/it]tensor([[-2.8906,  1.2656,  3.2031, -2.6094, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -3.4844,  0.1514,  2.6406, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -5.1875, -1.2812,  2.7812, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5938,  0.5195,  1.4375, -0.1533, -1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-0.4023,  2.6094,  3.2344, -0.4941, -1.0078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:34,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.54 | bwd_microstep: 8.46 | bwd_inner_microstep: 8.33 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.1094, -0.5898,  1.6875,  0.0747, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -3.8281, -1.2969,  2.2969, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -4.0000,  0.0723, -0.0119, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:27:35,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:27:35,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.85 | bwd_microstep: 273.34 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 272.25 | step_microstep: 3.38
[2025-11-06 18:27:35,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.40 | bwd: 281.79 | bwd_inner: 9.32 | bwd_allreduce: 272.30 | step: 3.47
 50%|████▉     | 1751/3507 [42:49<36:04,  1.23s/it]                                                   {'loss': 0.407, 'learning_rate': 1.0512443472667917e-05, 'epoch': 0.5}
 50%|████▉     | 1751/3507 [42:49<36:04,  1.23s/it]tensor([[-4.4375, -3.6250, -0.2812,  2.2969, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -2.5156,  1.4375,  0.7148, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5469,  0.6055,  3.6406, -1.8906, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -2.7344,  0.8906,  1.1562, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:35,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.84 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.3750, -3.6562,  0.5859, -0.4609, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031, -0.6133,  2.1406,  0.3535, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -3.7031,  0.4082,  1.0312, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7500, -3.7656,  1.4297,  0.1201, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:27:35,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:27:35,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 112.20 | bwd_microstep: 189.22 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 188.24 | step_microstep: 3.68
[2025-11-06 18:27:35,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.09 | bwd: 190.08 | bwd_inner: 1.66 | bwd_allreduce: 188.28 | step: 3.75
 50%|████▉     | 1752/3507 [42:49<29:48,  1.02s/it]                                                   {'loss': 0.3282, 'learning_rate': 1.050321813022148e-05, 'epoch': 0.5}
 50%|████▉     | 1752/3507 [42:49<29:48,  1.02s/it]tensor([[-4.4375, -0.5703,  2.4844, -2.0781, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:36,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.99 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-9.6875, -6.0312, -1.9219, -4.8750, -8.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -2.0000,  1.5078,  1.5391, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -2.8750,  0.7227,  2.8281, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -5.0000, -1.7188,  2.0312, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.2969,  1.5000,  0.0762, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1094,  0.3652,  2.4688, -1.6562, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.2812, -4.7812,  0.1738,  2.2500, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:27:36,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.42 | optimizer_step: 0.50
[2025-11-06 18:27:37,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.92 | bwd_microstep: 629.62 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 628.30 | step_microstep: 4.24
[2025-11-06 18:27:37,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 258.92 | bwd: 630.96 | bwd_inner: 2.31 | bwd_allreduce: 628.41 | step: 4.34
 50%|████▉     | 1753/3507 [42:50<30:41,  1.05s/it]                                                   {'loss': 0.451, 'learning_rate': 1.0493992358394136e-05, 'epoch': 0.5}
 50%|████▉     | 1753/3507 [42:50<30:41,  1.05s/it]tensor([[-2.3281, -2.9688, -2.2500,  1.2891, -0.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:27:37,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.18 | bwd_microstep: 1.85 | bwd_inner_microstep: 1.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.9375, -3.7031,  0.6406, -3.8438, -7.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.6250, -1.5156,  1.6406,  3.3906, -1.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -2.0938,  1.1328,  1.3203, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125, -0.2812,  1.2422, -1.1406, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.1250, -0.0065,  2.4062, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -1.8281,  1.5312,  0.2031, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8125, -1.5078,  2.6406,  2.0312, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:27:39,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:27:39,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.07 | bwd_microstep: 897.09 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 896.22 | step_microstep: 3.30
[2025-11-06 18:27:39,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.24 | bwd: 898.93 | bwd_inner: 2.45 | bwd_allreduce: 896.28 | step: 3.38
 50%|█████     | 1754/3507 [42:53<40:50,  1.40s/it]                                                   {'loss': 1.0359, 'learning_rate': 1.048476616505796e-05, 'epoch': 0.5}
 50%|█████     | 1754/3507 [42:53<40:50,  1.40s/it]tensor([[-4.8438, -3.5938, -0.1299,  1.7031, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0859,  2.2500,  2.9688, -1.3047, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.9375, -3.5156,  0.8945,  2.7344, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:27:39,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.32 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7188, -2.0938,  2.1094, -1.4375, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -3.7969,  0.0267,  1.1172, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -2.9531,  0.2656,  2.5312, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -3.6875,  0.6562,  1.7266, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -2.9531,  1.8828,  1.7734, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:40,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:27:40,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.87 | bwd_microstep: 1200.30 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1199.43 | step_microstep: 2.17
[2025-11-06 18:27:40,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.19 | bwd: 1201.57 | bwd_inner: 1.91 | bwd_allreduce: 1199.49 | step: 2.26
 50%|█████     | 1755/3507 [42:54<42:57,  1.47s/it]                                                   {'loss': 1.2757, 'learning_rate': 1.047553955808538e-05, 'epoch': 0.5}
 50%|█████     | 1755/3507 [42:54<42:57,  1.47s/it]tensor([[-4.6250, -4.9688, -2.5156,  1.7500, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -5.1562, -0.7891,  0.7578, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:41,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.28 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6250, -3.8125, -0.0649,  2.9062, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -2.2031,  2.7969,  2.3125, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7812, -4.6250, -1.1250, -1.3203, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6719, -1.1094,  1.8984,  0.0131, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -1.8828,  2.8281, -0.7852, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -3.0156, -0.5195,  2.9062, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:27:42,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:27:42,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.21 | bwd_microstep: 667.37 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 666.50 | step_microstep: 2.07
[2025-11-06 18:27:42,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.51 | bwd: 668.12 | bwd_inner: 1.41 | bwd_allreduce: 666.55 | step: 2.16
 50%|█████     | 1756/3507 [42:56<43:48,  1.50s/it]                                                   {'loss': 0.2355, 'learning_rate': 1.046631254534919e-05, 'epoch': 0.5}
 50%|█████     | 1756/3507 [42:56<43:48,  1.50s/it]tensor([[-5.6875, -5.9062, -2.7656,  1.7891, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:42,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.27 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1562, -1.4375,  1.8672,  0.2412, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625, -0.1514,  1.9453, -0.7891, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.2812,  0.6328,  2.0781, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -3.6562,  0.3281,  3.1875, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -0.1650,  2.5312, -0.8750, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4844,  2.0312,  2.6719, -2.0312, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5938, -0.2139,  3.4062, -2.3594, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:43,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:27:43,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.82 | bwd_microstep: 992.16 | bwd_inner_microstep: 5.25 | bwd_allreduce_microstep: 986.83 | step_microstep: 2.38
[2025-11-06 18:27:43,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.11 | bwd: 992.92 | bwd_inner: 5.90 | bwd_allreduce: 986.87 | step: 2.46
 50%|█████     | 1757/3507 [42:57<42:52,  1.47s/it]                                                   {'loss': 0.4436, 'learning_rate': 1.0457085134722516e-05, 'epoch': 0.5}
 50%|█████     | 1757/3507 [42:57<42:52,  1.47s/it]tensor([[-4.3750, -1.3281,  2.4688,  0.3164, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344,  0.9766,  3.3438, -2.5000, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:43,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.64 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1250, -2.7344,  1.1641,  0.7227, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5000, -5.6562, -0.6406,  0.9844, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -3.5469,  0.3770,  2.4531, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188,  0.0952,  3.9062, -1.4922, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8594, -3.6562, -2.3125,  2.1875, -0.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-7.4688, -6.0000, -1.6406,  0.2500, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:27:45,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.25 | optimizer_step: 0.20
[2025-11-06 18:27:45,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.60 | bwd_microstep: 1587.40 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1586.20 | step_microstep: 3.19
[2025-11-06 18:27:45,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.27 | bwd: 1588.11 | bwd_inner: 1.68 | bwd_allreduce: 1586.26 | step: 3.27
 50%|█████     | 1758/3507 [42:59<47:08,  1.62s/it]                                                   {'loss': 0.5717, 'learning_rate': 1.0447857334078828e-05, 'epoch': 0.5}
 50%|█████     | 1758/3507 [42:59<47:08,  1.62s/it]tensor([[-3.7188,  0.3242,  3.5156, -1.5156, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8242,  0.9297,  2.2188,  1.3750, -0.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -4.2188, -1.5312,  2.0156, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -1.4062,  2.6562, -0.1465, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:46,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.42 | bwd_microstep: 8.32 | bwd_inner_microstep: 8.18 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6875, -1.8125,  1.3438,  1.5312, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7812, -2.5938,  2.0625,  0.0981, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1875, -0.7148,  2.2656,  1.1875, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.2188, -1.3438,  1.8594, -0.2393, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:48,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.49 | optimizer_step: 0.42
[2025-11-06 18:27:48,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.36 | bwd_microstep: 1958.40 | bwd_inner_microstep: 1.79 | bwd_allreduce_microstep: 1956.18 | step_microstep: 6.10
[2025-11-06 18:27:48,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.80 | bwd: 1966.75 | bwd_inner: 10.03 | bwd_allreduce: 1956.37 | step: 6.16
 50%|█████     | 1759/3507 [43:02<59:45,  2.05s/it]                                                   {'loss': 0.9789, 'learning_rate': 1.0438629151291944e-05, 'epoch': 0.5}
 50%|█████     | 1759/3507 [43:02<59:45,  2.05s/it]tensor([[2.7188, 5.7188, 6.1250, 1.9219, 1.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.0469, 3.8594, 4.2812, 0.8906, 0.2852]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0312, -4.3750, -0.7422,  2.3750, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:49,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.81 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.2969, -3.8438, -1.9453,  2.4531, -0.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -2.7188,  1.8125,  1.4141, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -2.6250,  0.0212,  1.7734, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -2.8594,  1.4219,  1.1797, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -0.1445,  3.5312, -2.2656, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:49,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:27:49,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.55 | bwd_microstep: 5.56 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 4.70 | step_microstep: 3.13
[2025-11-06 18:27:49,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 494.39 | bwd: 6.36 | bwd_inner: 1.48 | bwd_allreduce: 4.74 | step: 3.22
 50%|█████     | 1760/3507 [43:03<46:35,  1.60s/it]                                                   {'loss': 0.4001, 'learning_rate': 1.0429400594235978e-05, 'epoch': 0.5}
 50%|█████     | 1760/3507 [43:03<46:35,  1.60s/it]tensor([[-2.9688, -2.4688,  0.5078,  3.4062, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -5.6875, -2.0156,  1.5938, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:49,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.74 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.9844, -4.5625, -2.3438,  2.3281, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0781, -1.5781,  0.8945,  1.1562, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375, -3.3750, -2.1250,  1.3984, -0.9961]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.1875, -4.0312, -0.6328,  3.3438, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -1.8438,  2.1406, -0.9453, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -2.2969,  1.0312,  0.5820, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:27:51,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:27:51,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.66 | bwd_microstep: 1071.59 | bwd_inner_microstep: 2.63 | bwd_allreduce_microstep: 1068.80 | step_microstep: 2.05
[2025-11-06 18:27:51,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.43 | bwd: 1072.39 | bwd_inner: 3.39 | bwd_allreduce: 1068.82 | step: 2.12
 50%|█████     | 1761/3507 [43:04<47:29,  1.63s/it]                                                   {'loss': 0.5228, 'learning_rate': 1.0420171670785392e-05, 'epoch': 0.5}
 50%|█████     | 1761/3507 [43:04<47:29,  1.63s/it]tensor([[-3.1562, -2.0156,  1.0625,  2.8438, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.7969,  0.6484,  2.0625, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812e+00, -3.8750e+00, -1.6797e-01,  3.0212e-03, -4.1562e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:51,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.74 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.5000, -5.8750, -0.8281,  0.9844, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -3.2188, -0.3867,  1.8516, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -2.7188,  0.7852,  3.3281, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -2.9219,  0.5391, -0.9531, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -1.7500,  2.6562, -0.3613, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:27:51,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.21 | optimizer_step: 0.24
[2025-11-06 18:27:51,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.18 | bwd_microstep: 145.21 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 144.25 | step_microstep: 2.12
[2025-11-06 18:27:51,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.95 | bwd: 146.00 | bwd_inner: 1.54 | bwd_allreduce: 144.29 | step: 2.22
 50%|█████     | 1762/3507 [43:05<38:05,  1.31s/it]                                                   {'loss': 0.2328, 'learning_rate': 1.0410942388814949e-05, 'epoch': 0.5}
 50%|█████     | 1762/3507 [43:05<38:05,  1.31s/it]tensor([[-3.9688, -2.2500,  1.1172,  1.5078, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -1.8672,  1.3438,  1.4453, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:52,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.92 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.2500, -1.5078,  2.5156, -1.1016, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -4.2500, -0.9297,  2.3594, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0781,  1.7188,  2.8750, -2.1406, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.5781, -0.1157,  1.3594, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -4.7812, -2.4688,  2.0469, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1094,  0.1680,  1.9609, -1.8359, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:27:53,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:27:53,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.07 | bwd_microstep: 609.36 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 608.03 | step_microstep: 2.39
[2025-11-06 18:27:53,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 501.00 | bwd: 610.16 | bwd_inner: 1.98 | bwd_allreduce: 608.06 | step: 2.47
 50%|█████     | 1763/3507 [43:07<45:17,  1.56s/it]                                                   {'loss': 0.6797, 'learning_rate': 1.0401712756199711e-05, 'epoch': 0.5}
 50%|█████     | 1763/3507 [43:07<45:17,  1.56s/it]tensor([[-5.5625, -2.8281,  0.6992, -0.8125, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:53,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 53.78 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.9688, -4.5312, -0.7148, -1.2578, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0000,  1.0312,  1.9062, -1.4062, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5625, -3.2344,  0.1133,  1.3203, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -0.8984,  3.4375, -0.5156, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -0.6797,  2.9375, -0.4043, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8750, -4.5625, -1.2109, -3.9688, -6.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -4.8438, -2.7656,  1.7422, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 18:27:54,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:27:54,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.98 | bwd_microstep: 805.89 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 804.86 | step_microstep: 1.64
[2025-11-06 18:27:54,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 194.78 | bwd: 806.82 | bwd_inner: 1.80 | bwd_allreduce: 804.90 | step: 1.73
 50%|█████     | 1764/3507 [43:08<40:39,  1.40s/it]                                                   {'loss': 0.7564, 'learning_rate': 1.0392482780815052e-05, 'epoch': 0.5}
 50%|█████     | 1764/3507 [43:08<40:39,  1.40s/it]tensor([[-3.7500, -1.1641,  1.8984,  0.3730, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.5625, -1.7734,  2.1094, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:55,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.75 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -1.4297,  1.8203, -0.6328, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2812, -2.8281,  1.4453,  1.0078, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -3.9531, -0.3516,  0.9297, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.1875, -0.7656,  1.9531, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0781, -2.5312, -1.5234,  1.7578, -0.3223]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1719,  1.4688,  1.7969, -1.0078, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:27:57,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:27:57,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.26 | bwd_microstep: 1853.12 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1852.25 | step_microstep: 2.12
[2025-11-06 18:27:57,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.04 | bwd: 1854.06 | bwd_inner: 1.61 | bwd_allreduce: 1852.29 | step: 2.19
 50%|█████     | 1765/3507 [43:10<48:23,  1.67s/it]                                                   {'loss': 0.9138, 'learning_rate': 1.0383252470536631e-05, 'epoch': 0.5}
 50%|█████     | 1765/3507 [43:10<48:23,  1.67s/it]tensor([[-5.6250, -2.5781,  1.6953, -0.1123, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5000, -5.5000, -1.4609,  1.2812, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:27:57,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.84 | bwd_microstep: 1.34 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13
tensor([[-2.1875, -2.2344, -0.7422,  2.2031, -0.5547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -2.5625,  0.8398,  2.6562, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -0.5469,  2.7031,  0.1514, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.5625,  1.3750,  2.5156, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.8750, -6.6250, -1.5156,  1.6484, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8203,  1.7266,  3.1094, -1.2812, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:27:58,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.24 | optimizer_step: 0.35
[2025-11-06 18:27:58,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 317.01 | bwd_microstep: 1191.85 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1190.56 | step_microstep: 2.57
[2025-11-06 18:27:58,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 508.91 | bwd: 1193.18 | bwd_inner: 2.29 | bwd_allreduce: 1190.65 | step: 2.70
 50%|█████     | 1766/3507 [43:12<49:06,  1.69s/it]                                                   {'loss': 0.8152, 'learning_rate': 1.0374021833240391e-05, 'epoch': 0.5}
 50%|█████     | 1766/3507 [43:12<49:06,  1.69s/it]tensor([[-5.3750, -4.9062, -1.6484,  1.5859, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.0625,  0.5195,  1.5859, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.8125, -0.5938,  2.8594, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:27:59,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.90 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.6562, -3.7812,  1.4062,  0.5781, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9453, -0.3086,  2.4062,  3.1562, -0.8164]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -3.2969,  0.4434,  0.6016, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -2.5469,  1.5312,  0.3711, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125, -0.8906,  1.6875,  1.6953, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:28:01,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:28:01,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.37 | bwd_microstep: 1978.52 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1977.56 | step_microstep: 1.83
[2025-11-06 18:28:01,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.29 | bwd: 1979.51 | bwd_inner: 1.74 | bwd_allreduce: 1977.62 | step: 1.93
 50%|█████     | 1767/3507 [43:15<55:20,  1.91s/it]                                                   {'loss': 0.7219, 'learning_rate': 1.0364790876802564e-05, 'epoch': 0.5}
 50%|█████     | 1767/3507 [43:15<55:20,  1.91s/it]tensor([[-4.4688, -4.5938, -2.1094,  1.9922, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:01,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.65 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9531, -2.5938,  0.9727,  2.7500, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.9062, -0.2793,  1.6250, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -4.4375,  0.7500,  1.8203, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -3.4844,  0.0854,  1.2812, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4844, -1.5938,  1.7109,  4.0000, -0.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.6250, -1.1484,  2.2969, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -4.1250,  0.7266,  1.7969, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:28:01,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:28:01,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.24 | bwd_microstep: 202.84 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 201.64 | step_microstep: 1.83
[2025-11-06 18:28:01,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.91 | bwd: 203.81 | bwd_inner: 1.99 | bwd_allreduce: 201.68 | step: 1.91
 50%|█████     | 1768/3507 [43:15<43:56,  1.52s/it]                                                   {'loss': 0.6621, 'learning_rate': 1.0355559609099641e-05, 'epoch': 0.5}
 50%|█████     | 1768/3507 [43:15<43:56,  1.52s/it]tensor([[-4.1250, -5.0312, -3.1094,  2.1406, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:28:02,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.13 | bwd_microstep: 1.52 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9375,  0.1875,  2.2188, -0.8242, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -1.3594,  1.8672, -0.8320, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688, -2.3125,  0.5977,  3.1562, -1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0000, -3.1250,  1.7109,  0.6406, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[ 0.3398,  3.2344,  2.7969, -1.3984, -0.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6562, -1.8672,  2.0156,  0.6172, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1250, -2.8594, -1.7109,  2.3906, -0.1475]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:28:03,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:28:03,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.54 | bwd_microstep: 1663.29 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1662.20 | step_microstep: 1.73
[2025-11-06 18:28:03,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.70 | bwd: 1664.79 | bwd_inner: 2.36 | bwd_allreduce: 1662.25 | step: 1.82
 50%|█████     | 1769/3507 [43:17<48:25,  1.67s/it]                                                   {'loss': 1.2751, 'learning_rate': 1.034632803800839e-05, 'epoch': 0.5}
 50%|█████     | 1769/3507 [43:17<48:25,  1.67s/it]tensor([[-4.8750, -3.9844, -0.4746,  2.2812, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -0.3477,  1.9453, -0.5859, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -3.3750,  0.4551,  2.1406, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:04,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.22 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -1.5703,  2.6562,  1.6484, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[ 1.2188,  3.8281,  3.2500, -0.0806,  0.2949]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
tensor([[-6.5938, -5.2188, -0.2539,  2.2812, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -3.8594, -1.7109,  2.5781, -1.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156,  0.8516,  4.0312, -1.5703, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:04,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:28:04,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.22 | bwd_microstep: 1.88 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.90
[2025-11-06 18:28:04,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 453.47 | bwd: 2.93 | bwd_inner: 1.94 | bwd_allreduce: 0.86 | step: 1.99
 50%|█████     | 1770/3507 [43:18<38:11,  1.32s/it]                                                   {'loss': 0.5051, 'learning_rate': 1.0337096171405832e-05, 'epoch': 0.5}
 50%|█████     | 1770/3507 [43:18<38:11,  1.32s/it]tensor([[-2.7031, -0.6016,  1.8984,  0.9883, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -4.5938, -2.2812,  1.8828, -1.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -0.4062,  2.7812, -1.0703, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:04,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.76 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5000, -3.1250,  1.2266, -1.1016, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8516, -1.2891, -0.2490,  3.0156,  0.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -3.6719,  0.2617,  2.4375, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -3.7031,  0.5781,  2.4219, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938, -3.8906, -1.5859,  2.3281, -1.3672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:28:07,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.22 | optimizer_step: 0.26
[2025-11-06 18:28:07,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.04 | bwd_microstep: 2795.34 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2794.18 | step_microstep: 2.53
[2025-11-06 18:28:07,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.83 | bwd: 2796.01 | bwd_inner: 1.62 | bwd_allreduce: 2794.23 | step: 2.61
 50%|█████     | 1771/3507 [43:21<54:37,  1.89s/it]                                                   {'loss': 0.1375, 'learning_rate': 1.032786401716924e-05, 'epoch': 0.5}
 50%|█████     | 1771/3507 [43:21<54:37,  1.89s/it]tensor([[-5.7812, -2.8594,  1.7344,  0.4375, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -3.8906,  0.7500,  2.6094, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.4688, -0.5078,  1.9219, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000, -3.8281, -1.4531,  2.7969, -1.1484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:07,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.23 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5938, -5.6875, -1.6953,  1.3516, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -1.6562,  1.4453, -0.5234, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -3.9219,  0.1011,  1.4609, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7500, -5.4375, -0.5625,  2.0312, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:08,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:28:08,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.26 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.11
[2025-11-06 18:28:08,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.51 | bwd: 3.12 | bwd_inner: 2.13 | bwd_allreduce: 0.87 | step: 2.19
 51%|█████     | 1772/3507 [43:21<42:28,  1.47s/it]                                                   {'loss': 0.1353, 'learning_rate': 1.0318631583176136e-05, 'epoch': 0.51}
 51%|█████     | 1772/3507 [43:21<42:28,  1.47s/it]tensor([[-4.8438, -4.5625, -1.2734,  2.4844, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -2.8750,  0.3984,  2.0312, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:08,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.60 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[0.6250, 3.3438, 4.3125, 1.3516, 0.1021]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -4.1562, -0.3770,  0.1816, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -3.9219, -0.1611,  2.0312, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5469, -1.3125,  2.1406,  3.8125, -1.0547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469,  0.3535,  3.1094, -1.3203, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -2.2188,  0.7461,  0.9766, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:28:09,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:28:09,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.02 | bwd_microstep: 762.00 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 760.37 | step_microstep: 1.73
[2025-11-06 18:28:09,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.64 | bwd: 762.79 | bwd_inner: 2.27 | bwd_allreduce: 760.40 | step: 1.80
 51%|█████     | 1773/3507 [43:23<39:50,  1.38s/it]                                                   {'loss': 0.3308, 'learning_rate': 1.0309398877304278e-05, 'epoch': 0.51}
 51%|█████     | 1773/3507 [43:23<39:50,  1.38s/it]tensor([[-2.6094,  1.0781,  3.3906, -0.7383, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2891, -1.3672,  0.7695,  4.2500,  0.4570]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -0.2520,  2.4531, -0.3125, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:09,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.35 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6719, -4.0000, -1.8750,  2.0938, -1.4453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938, -2.9531, -0.0698,  2.3906, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -2.1562,  1.2656,  2.1094, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -4.4062,  0.5742,  1.3828, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -0.5625,  3.2656, -1.0078, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:28:09,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:28:09,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.62 | bwd_microstep: 35.31 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 34.27 | step_microstep: 1.46
[2025-11-06 18:28:09,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 410.00 | bwd: 36.21 | bwd_inner: 1.79 | bwd_allreduce: 34.30 | step: 1.54
 51%|█████     | 1774/3507 [43:23<32:04,  1.11s/it]                                                   {'loss': 0.4626, 'learning_rate': 1.0300165907431652e-05, 'epoch': 0.51}
 51%|█████     | 1774/3507 [43:23<32:04,  1.11s/it]tensor([[-2.5156,  0.3145,  3.1094,  0.8828, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -3.4688, -0.4355,  3.3594, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:09,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.51 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7031, -3.0469, -0.1455,  2.1250, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719,  1.7578,  3.0469, -2.2500, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -2.9375, -0.8281,  2.0469, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -2.5625,  0.9062,  1.6250, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938, -0.4922,  3.2188, -0.3242, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -0.7891,  1.1797,  0.4219, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:28:11,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.28 | optimizer_step: 0.29
[2025-11-06 18:28:11,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.76 | bwd_microstep: 1735.85 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1734.73 | step_microstep: 2.90
[2025-11-06 18:28:11,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.31 | bwd: 1736.74 | bwd_inner: 1.78 | bwd_allreduce: 1734.79 | step: 2.99
 51%|█████     | 1775/3507 [43:25<41:00,  1.42s/it]                                                   {'loss': 0.2061, 'learning_rate': 1.0290932681436482e-05, 'epoch': 0.51}
 51%|█████     | 1775/3507 [43:25<41:00,  1.42s/it]tensor([[-4.2812, -4.3750, -1.6094,  2.2812, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -5.3125, -2.0156,  1.9922, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.4531,  1.0156, -0.0640, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7031,  0.9258,  3.5000, -0.9688, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7500, -1.3984,  2.1094,  1.7891, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -2.8125,  1.7109,  0.4355, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -1.5078,  1.5234, -0.4863, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:13,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.56 | bwd_microstep: 1.17 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-3.7188, -1.8359,  1.0859,  0.9609, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:14,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:28:14,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.73 | bwd_microstep: 1.95 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.13
[2025-11-06 18:28:14,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.24 | bwd: 3.11 | bwd_inner: 1.96 | bwd_allreduce: 0.95 | step: 2.26
 51%|█████     | 1776/3507 [43:28<48:47,  1.69s/it]                                                   {'loss': 0.3251, 'learning_rate': 1.0281699207197196e-05, 'epoch': 0.51}
 51%|█████     | 1776/3507 [43:28<48:47,  1.69s/it]tensor([[-5.5938, -5.6875, -2.7500,  1.5391, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:14,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.56 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.0640, -0.4629,  0.8438,  4.5312,  1.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.0625, -1.2031,  1.6328, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -3.5625,  0.5156,  0.1270, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -3.2500, -0.6836,  2.4219, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8516,  1.8047,  1.9609, -1.3438, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -3.2031, -0.2500,  1.0547, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -4.1875, -0.3301,  2.2344, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:28:14,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:28:14,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.43 | bwd_microstep: 126.58 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 125.56 | step_microstep: 1.49
[2025-11-06 18:28:14,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.02 | bwd: 127.54 | bwd_inner: 1.79 | bwd_allreduce: 125.61 | step: 1.59
 51%|█████     | 1777/3507 [43:28<38:06,  1.32s/it]                                                   {'loss': 0.2286, 'learning_rate': 1.027246549259244e-05, 'epoch': 0.51}
 51%|█████     | 1777/3507 [43:28<38:06,  1.32s/it]tensor([[-5.1250, -2.5781,  1.0859,  0.3730, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.0312,  0.7148,  3.2188, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1875, -5.9375, -0.8867,  2.0469, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5156, -0.0601,  1.9766, -1.9766, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -4.8750, -0.3164,  2.1250, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -4.3125,  0.7578,  2.5625, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -4.7188, -0.8281,  0.3652, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:16,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.85 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7031, -3.2031, -0.7539,  2.0312, -1.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:16,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:28:16,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.47 | bwd_microstep: 6.69 | bwd_inner_microstep: 5.74 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.37
[2025-11-06 18:28:16,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.33 | bwd: 7.65 | bwd_inner: 6.61 | bwd_allreduce: 0.91 | step: 2.45
 51%|█████     | 1778/3507 [43:30<40:11,  1.39s/it]                                                   {'loss': 0.6799, 'learning_rate': 1.0263231545501068e-05, 'epoch': 0.51}
 51%|█████     | 1778/3507 [43:30<40:11,  1.39s/it]tensor([[-3.2344, -1.7266,  0.8438,  1.3906, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -4.1875, -1.4844,  2.7656, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -0.4082,  2.6562, -1.2266, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)[2025-11-06 18:28:16,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.17 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
 tensor([2], device='cuda:3')
tensor([[-3.3438, -1.2891,  1.2422,  0.7109, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5938, -5.3125, -1.6328,  0.3008, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -4.6250, -1.4297,  2.3594, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5781,  0.0396,  2.6250,  0.9336, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -2.7344,  0.9414,  1.9766, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:16,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 6.45 | optimizer_step: 0.20
[2025-11-06 18:28:16,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.42 | bwd_microstep: 2.14 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.93 | step_microstep: 8.03
[2025-11-06 18:28:16,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.61 | bwd: 2.92 | bwd_inner: 1.77 | bwd_allreduce: 0.97 | step: 8.13
 51%|█████     | 1779/3507 [43:30<31:44,  1.10s/it]                                                   {'loss': 0.4321, 'learning_rate': 1.0253997373802132e-05, 'epoch': 0.51}
 51%|█████     | 1779/3507 [43:30<31:44,  1.10s/it]tensor([[-2.5938,  1.5781,  4.1562, -1.5000, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.8125, -5.1562,  0.5078,  0.4746, -5.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9062, -5.9688, -1.3047,  2.1094, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -3.6719,  0.8516,  1.3906, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -2.9062,  1.7031,  0.7734, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.7812, -0.0762,  2.7656, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -4.0625,  0.8164,  2.0938, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:19,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.75 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.6562, -3.9375,  1.4062,  1.0703, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:19,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:28:19,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.32 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.47
[2025-11-06 18:28:19,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.08 | bwd: 2.85 | bwd_inner: 1.83 | bwd_allreduce: 0.87 | step: 2.56
 51%|█████     | 1780/3507 [43:33<49:16,  1.71s/it]                                                   {'loss': 0.3106, 'learning_rate': 1.0244762985374863e-05, 'epoch': 0.51}
 51%|█████     | 1780/3507 [43:33<49:16,  1.71s/it]tensor([[-5.5938, -3.2344,  1.3438,  1.4531, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:19,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.17 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.8125, -4.5938, -1.1016,  0.7656, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.4688,  0.0630, -0.3906, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219,  0.5039,  2.4375, -2.5156, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.4219,  1.1328,  2.8438, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -1.7578,  1.8438,  2.8438, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1250, -6.0312, -1.3438,  1.7188, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -2.6719,  1.6406,  1.1016, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:28:20,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:28:20,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.88 | bwd_microstep: 81.18 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 80.36 | step_microstep: 1.46
[2025-11-06 18:28:20,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.07 | bwd: 82.12 | bwd_inner: 1.62 | bwd_allreduce: 80.39 | step: 1.53
 51%|█████     | 1781/3507 [43:34<38:25,  1.34s/it]                                                   {'loss': 0.3871, 'learning_rate': 1.0235528388098701e-05, 'epoch': 0.51}
 51%|█████     | 1781/3507 [43:34<38:25,  1.34s/it]tensor([[-2.8281, -2.1406,  0.7422,  3.3125, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -5.6250, -1.6875,  2.8906, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -0.1973,  2.3906, -2.1719, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -3.6875, -0.5391,  1.8906, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -4.7500, -0.9258,  1.9766, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -4.4688, -0.5625,  1.3203, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -4.0938, -1.3281,  3.0156, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:21,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.35 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -1.2734,  3.1094, -1.1250, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:21,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:28:21,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.26 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.97
[2025-11-06 18:28:21,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.62 | bwd: 2.75 | bwd_inner: 1.76 | bwd_allreduce: 0.84 | step: 2.05
 51%|█████     | 1782/3507 [43:35<36:29,  1.27s/it]                                                   {'loss': 0.373, 'learning_rate': 1.0226293589853238e-05, 'epoch': 0.51}
 51%|█████     | 1782/3507 [43:35<36:29,  1.27s/it]tensor([[-3.5938,  0.2969,  2.5781, -2.3750, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.5938, -3.7031, -1.4219,  2.3438, -1.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:21,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.03 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -4.5000, -0.5352,  1.2031, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219, -3.5000, -0.4688,  3.8125, -1.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -3.3281, -2.6094,  1.3984, -0.4414]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.0625, -4.1875, -1.1016,  3.2656, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -4.2500, -1.0781,  2.4062, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1250, -2.2031,  1.3828,  3.8906, -1.2734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:28:22,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:28:22,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.90 | bwd_microstep: 97.74 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 96.80 | step_microstep: 2.07
[2025-11-06 18:28:22,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.94 | bwd: 98.56 | bwd_inner: 1.59 | bwd_allreduce: 96.84 | step: 2.14
 51%|█████     | 1783/3507 [43:36<37:25,  1.30s/it]                                                   {'loss': 0.8188, 'learning_rate': 1.0217058598518259e-05, 'epoch': 0.51}
 51%|█████     | 1783/3507 [43:36<37:25,  1.30s/it]tensor([[-5.4688, -3.5312,  0.3652,  0.5938, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9219,  0.1152,  1.0781, -1.8906, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.9375, -1.7656,  0.8555,  2.2656, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -0.9375,  2.0156, -0.0549, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.8594,  0.8359,  1.7734, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -2.1562,  1.7188,  0.3770, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -5.6562, -2.7812,  1.7344, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:25,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.15 | bwd_microstep: 1.14 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.2500, -4.3750, -0.8086,  1.7500, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:25,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.83 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:28:25,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.13 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.86 | step_microstep: 5.45
[2025-11-06 18:28:25,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.32 | bwd: 2.93 | bwd_inner: 1.85 | bwd_allreduce: 0.91 | step: 5.56
 51%|█████     | 1784/3507 [43:39<51:01,  1.78s/it]                                                   {'loss': 0.4146, 'learning_rate': 1.02078234219737e-05, 'epoch': 0.51}
 51%|█████     | 1784/3507 [43:39<51:01,  1.78s/it]tensor([[-3.9844, -4.4375, -2.0781,  2.6406, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -4.3438, -2.7500,  1.8828, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:25,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.72 | bwd_microstep: 1.46 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.16
tensor([[-5.0000, -2.7812,  0.9883,  0.8125, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -3.0156, -0.2988,  2.9531, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7656, -3.2656, -1.4062,  3.2188, -0.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9297,  1.2031,  2.1562, -1.6484, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -3.0469,  0.1670,  1.9609, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -3.1406,  1.4609,  0.7656, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:28:26,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:28:26,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.14 | bwd_microstep: 177.76 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 176.70 | step_microstep: 2.51
[2025-11-06 18:28:26,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.90 | bwd: 179.21 | bwd_inner: 2.19 | bwd_allreduce: 176.79 | step: 2.67
 51%|█████     | 1785/3507 [43:40<40:19,  1.41s/it]                                                   {'loss': 0.4664, 'learning_rate': 1.0198588068099658e-05, 'epoch': 0.51}
 51%|█████     | 1785/3507 [43:40<40:19,  1.41s/it]tensor([[-6.0000, -3.9688,  0.3008,  0.7852, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -3.0000,  0.0231,  1.7109, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -2.8438,  0.6367, -0.9062, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9180,  2.3594,  3.6562, -0.6641, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4219,  1.1797,  3.0625, -1.7031, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -4.7812,  0.6055,  2.2500, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -3.7344, -0.2734,  2.9688, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:28,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.40 | bwd_microstep: 1.67 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-4.4688, -2.4219,  1.1094,  1.2656, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:28,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.24 | optimizer_step: 0.31
[2025-11-06 18:28:28,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.25 | bwd_microstep: 2.75 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 1.35 | step_microstep: 2.55
[2025-11-06 18:28:28,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.64 | bwd: 4.42 | bwd_inner: 2.80 | bwd_allreduce: 1.41 | step: 2.69
 51%|█████     | 1786/3507 [43:42<48:32,  1.69s/it]                                                   {'loss': 0.2935, 'learning_rate': 1.0189352544776387e-05, 'epoch': 0.51}
 51%|█████     | 1786/3507 [43:42<48:32,  1.69s/it]tensor([[-4.3750, -1.4688,  1.9609,  0.1357, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688,  0.9336,  3.2500, -1.9688, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.2500,  1.7500,  0.5938, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:28,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.78 | bwd_microstep: 2.31 | bwd_inner_microstep: 1.86 | bwd_allreduce_microstep: 0.16 | step_microstep: 0.26
tensor([[-3.5625, -1.9766,  0.6797,  1.1719, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -3.3750, -0.1187,  2.3281, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9062, -4.2500, -1.6875,  2.8281, -1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8750,  1.6094,  3.0000, -1.5078, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.0312, -2.8906,  1.5469,  1.5234, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:28:29,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.27 | optimizer_step: 0.17
[2025-11-06 18:28:29,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.02 | bwd_microstep: 156.70 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 155.68 | step_microstep: 2.39
[2025-11-06 18:28:29,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.89 | bwd: 159.03 | bwd_inner: 2.82 | bwd_allreduce: 155.87 | step: 2.64
 51%|█████     | 1787/3507 [43:42<38:53,  1.36s/it]                                                   {'loss': 0.753, 'learning_rate': 1.018011685988428e-05, 'epoch': 0.51}
 51%|█████     | 1787/3507 [43:42<38:53,  1.36s/it]tensor([[-5.3125, -4.4062, -1.0312,  1.0781, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906,  1.4453,  4.0000, -1.9531, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -2.7500,  1.7969,  0.1338, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -5.7812, -2.3906,  2.3594, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.8438, -0.0728,  1.9922, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7891,  1.6094,  2.7969, -2.0781, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6250, -3.5625,  0.6914, -0.8750, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:30,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.49 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4375, -4.6875,  0.0564,  1.3828, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:31,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.22 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:28:31,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.21 | bwd_microstep: 2.23 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 0.88 | step_microstep: 3.52
[2025-11-06 18:28:31,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.72 | bwd: 3.31 | bwd_inner: 2.26 | bwd_allreduce: 0.91 | step: 3.60
 51%|█████     | 1788/3507 [43:44<44:17,  1.55s/it]                                                   {'loss': 0.3545, 'learning_rate': 1.0170881021303867e-05, 'epoch': 0.51}
 51%|█████     | 1788/3507 [43:44<44:17,  1.55s/it]tensor([[-2.5781,  1.1094,  2.6562, -2.3281, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.7031,  0.5664,  2.3594, -1.2891, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2188,  1.4688,  2.6406, -2.2500, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:31,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.21 | bwd_microstep: 1.37 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-8.0625, -7.0000, -2.8750, -0.4238, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -3.9844, -0.2471,  1.9141, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -5.0312, -0.4805,  2.6094, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.4277,  2.2031,  1.7656, -1.8438, -1.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.2812, -6.2812, -1.2031,  2.3125, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:28:33,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:28:33,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.03 | bwd_microstep: 1482.94 | bwd_inner_microstep: 1.52 | bwd_allreduce_microstep: 1481.32 | step_microstep: 1.74
[2025-11-06 18:28:33,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 411.27 | bwd: 1484.30 | bwd_inner: 2.80 | bwd_allreduce: 1481.36 | step: 1.83
 51%|█████     | 1789/3507 [43:46<47:36,  1.66s/it]                                                   {'loss': 0.3849, 'learning_rate': 1.0161645036915818e-05, 'epoch': 0.51}
 51%|█████     | 1789/3507 [43:46<47:36,  1.66s/it]tensor([[-6.1875, -4.2500,  0.3457,  1.1562, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -2.8438,  1.3203,  1.1719, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:33,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.19 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3125e+00, -5.7422e-01,  1.8828e+00, -8.7738e-04, -2.7812e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -2.5469,  0.0815, -0.1035, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -4.8125,  0.2832,  1.6797, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.9805,  2.1094,  3.2812, -0.6836, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.0000, -2.3594,  0.8047,  1.3984, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -3.6562,  0.2969,  2.8906, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:28:33,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:28:33,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.36 | bwd_microstep: 107.67 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 106.64 | step_microstep: 2.24
[2025-11-06 18:28:33,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.58 | bwd: 108.38 | bwd_inner: 1.54 | bwd_allreduce: 106.69 | step: 2.32
 51%|█████     | 1790/3507 [43:47<37:07,  1.30s/it]                                                   {'loss': 0.5602, 'learning_rate': 1.0152408914600911e-05, 'epoch': 0.51}
 51%|█████     | 1790/3507 [43:47<37:07,  1.30s/it]tensor([[-4.5938, -2.2812,  1.8594,  1.5938, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -2.4219,  1.9766,  0.3105, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.8281, -0.2812,  2.0469, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -4.4375, -0.5273,  0.9805, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:33,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.89 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2812, -3.7812,  0.4727,  2.3125, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8359,  1.1953,  3.4688,  0.5039, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -1.1016,  2.4375, -1.6562, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -3.6562,  0.0991, -0.1191, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:28:33,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:28:33,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.95 | bwd_microstep: 12.96 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 11.98 | step_microstep: 1.43
[2025-11-06 18:28:33,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.85 | bwd: 13.89 | bwd_inner: 1.73 | bwd_allreduce: 12.02 | step: 1.52
 51%|█████     | 1791/3507 [43:47<29:13,  1.02s/it]                                                   {'loss': 0.2889, 'learning_rate': 1.0143172662240062e-05, 'epoch': 0.51}
 51%|█████     | 1791/3507 [43:47<29:13,  1.02s/it]tensor([[-4.6562, -2.6406,  0.9102,  1.1484, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375, -2.5156,  0.3008,  2.1094, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -4.0625, -1.3516,  2.1250, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -3.5781, -0.1816,  2.4531, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.7188, -0.7422,  2.3438, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:34,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.98 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7188, -0.3516,  2.5625, -0.7734, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3438,  1.9609,  4.1875, -2.1562, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7656, -3.7031, -0.9727,  2.8906, -1.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:28:35,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:28:35,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 243.99 | bwd_microstep: 167.26 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 166.14 | step_microstep: 2.31
[2025-11-06 18:28:35,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 436.97 | bwd: 168.22 | bwd_inner: 1.88 | bwd_allreduce: 166.19 | step: 2.41
 51%|█████     | 1792/3507 [43:49<34:32,  1.21s/it]                                                   {'loss': 0.1493, 'learning_rate': 1.0133936287714281e-05, 'epoch': 0.51}
 51%|█████     | 1792/3507 [43:49<34:32,  1.21s/it]tensor([[-5.0312, -2.4531,  1.2656, -0.2969, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:35,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.10 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1250, -0.8164,  0.1660, -3.7344, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.2500, -0.0933,  2.8125,  2.3906, -1.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.6562, -6.1562, -0.7227,  1.8750, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -1.3750,  2.8438, -1.1328, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -3.0156,  0.4258,  2.2188, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.5625, -1.6172,  1.3750, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -5.2500, -0.8633,  2.4375, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:28:37,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:28:37,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.42 | bwd_microstep: 1640.88 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1639.74 | step_microstep: 2.07
[2025-11-06 18:28:37,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.55 | bwd: 1641.74 | bwd_inner: 1.83 | bwd_allreduce: 1639.78 | step: 2.14
 51%|█████     | 1793/3507 [43:51<41:28,  1.45s/it]                                                   {'loss': 0.3104, 'learning_rate': 1.012469979890469e-05, 'epoch': 0.51}
 51%|█████     | 1793/3507 [43:51<41:28,  1.45s/it]tensor([[-4.7500, -2.0000,  1.9453,  0.6289, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:37,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.76 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-1.9297,  0.3164,  3.1094,  2.0781, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.8750, -0.2090,  2.2031, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -2.2344,  1.5938, -1.8750, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6562, -3.9844,  1.1719,  0.6836, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -1.4766,  2.8281, -1.3359, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -2.9688,  0.9922,  0.7188, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -4.6562, -0.1992,  0.7578, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:28:38,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:28:38,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.86 | bwd_microstep: 177.47 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 176.39 | step_microstep: 1.89
[2025-11-06 18:28:38,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.65 | bwd: 178.35 | bwd_inner: 1.79 | bwd_allreduce: 176.42 | step: 1.96
 51%|█████     | 1794/3507 [43:51<33:37,  1.18s/it]                                                   {'loss': 0.823, 'learning_rate': 1.0115463203692507e-05, 'epoch': 0.51}
 51%|█████     | 1794/3507 [43:51<33:37,  1.18s/it]tensor([[-4.0625, -2.5625,  0.5742,  1.3281, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -2.9531,  0.1973, -0.9453, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:38,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.45 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-8.6250, -5.4062,  0.3789, -0.3242, -6.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938,  0.2891,  3.9844, -1.6328, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -3.1094, -0.2559,  0.9414, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8281, -1.1016,  1.3047,  1.1406, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -4.1250, -0.7344,  2.2812, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-7.7500, -5.0000,  0.6211,  0.2559, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:28:39,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:28:39,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.15 | bwd_microstep: 963.92 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 962.67 | step_microstep: 2.09
[2025-11-06 18:28:39,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.63 | bwd: 965.01 | bwd_inner: 2.09 | bwd_allreduce: 962.74 | step: 2.19
 51%|█████     | 1795/3507 [43:53<37:18,  1.31s/it]                                                   {'loss': 1.2069, 'learning_rate': 1.0106226509959045e-05, 'epoch': 0.51}
 51%|█████     | 1795/3507 [43:53<37:18,  1.31s/it]tensor([[-4.4688, -2.1719,  2.3906,  2.3594, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.1250, -1.0703,  2.5625, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:39,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.08 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.3750, -0.1118,  2.2812, -0.8867, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5625, -3.0156, -0.3574,  2.1562, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.9688, -0.3262,  2.6719, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.5781, -0.6875,  2.5938,  0.7383, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([2], device='cuda:2')
tensor([[-4.7500, -2.0156,  0.7070, -1.1641, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -1.0938,  2.7969, -0.8125, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:40,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:28:40,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.38 | bwd_microstep: 2.38 | bwd_inner_microstep: 1.52 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.93
[2025-11-06 18:28:40,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 412.40 | bwd: 3.39 | bwd_inner: 2.44 | bwd_allreduce: 0.82 | step: 2.01
 51%|█████     | 1796/3507 [43:54<37:08,  1.30s/it]                                                   {'loss': 0.1861, 'learning_rate': 1.009698972558569e-05, 'epoch': 0.51}
 51%|█████     | 1796/3507 [43:54<37:08,  1.30s/it]tensor([[-5.1562, -4.6250, -1.1172,  2.1094, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:41,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.57 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.2500, -4.2500, -1.4609,  2.3125, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750,  0.3262,  3.6562, -2.0000, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844,  0.0417,  3.5312, -0.2207, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7500, -2.6875,  1.4453,  2.0156, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0469,  1.6016,  2.4688, -2.8125, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -4.5312, -1.2891,  2.2031, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2656,  1.9375,  2.5156, -1.8594, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:28:42,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:28:42,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.82 | bwd_microstep: 1412.67 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1411.50 | step_microstep: 2.07
[2025-11-06 18:28:42,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.42 | bwd: 1413.44 | bwd_inner: 1.78 | bwd_allreduce: 1411.53 | step: 2.14
 51%|█████     | 1797/3507 [43:56<41:31,  1.46s/it]                                                   {'loss': 0.1901, 'learning_rate': 1.0087752858453923e-05, 'epoch': 0.51}
 51%|█████     | 1797/3507 [43:56<41:31,  1.46s/it]tensor([[-3.0938,  0.3633,  2.6719, -1.1562, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:42,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.92 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-5.9375, -2.4844,  2.4688, -0.0069, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -3.4844, -0.4863,  2.9219, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.9531, -0.3203,  2.5781, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1250,  1.8281,  3.2656, -2.4375, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3438, -3.1094,  0.9570,  3.0000, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -1.2500,  1.6797, -0.6406, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -3.5312,  1.5312,  0.7656, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:28:43,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:28:43,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.18 | bwd_microstep: 65.62 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 64.57 | step_microstep: 1.73
[2025-11-06 18:28:43,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.11 | bwd: 66.64 | bwd_inner: 1.89 | bwd_allreduce: 64.61 | step: 1.83
 51%|█████▏    | 1798/3507 [43:57<32:49,  1.15s/it]                                                   {'loss': 0.4054, 'learning_rate': 1.0078515916445276e-05, 'epoch': 0.51}
 51%|█████▏    | 1798/3507 [43:57<32:49,  1.15s/it]tensor([[-6.3125, -4.8125, -0.1260,  1.9062, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -1.6172,  1.5703, -0.4746, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -4.0625, -0.4727,  0.1309, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:43,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.03 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1562, -4.1250, -0.8047,  3.4219, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -1.2500, -0.4941, -3.3125, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.8438, -3.2656,  0.8242,  2.3594, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6406,  0.1533,  2.3125, -2.3281, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -4.0000, -0.1875,  2.3906, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:28:44,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:28:44,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.56 | bwd_microstep: 736.28 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 735.23 | step_microstep: 2.12
[2025-11-06 18:28:44,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.62 | bwd: 737.26 | bwd_inner: 1.86 | bwd_allreduce: 735.27 | step: 2.21
 51%|█████▏    | 1799/3507 [43:58<32:41,  1.15s/it]                                                   {'loss': 0.3749, 'learning_rate': 1.0069278907441355e-05, 'epoch': 0.51}
 51%|█████▏    | 1799/3507 [43:58<32:41,  1.15s/it]tensor([[-2.6719,  0.3027,  2.3750, -0.8633, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -2.4375,  2.1562, -0.5742, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.8438, -4.5000,  0.7305,  1.1250, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -1.6797,  0.9453, -0.8711, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -3.1719,  2.4531,  1.7734, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:45,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.30 | bwd_microstep: 4.48 | bwd_inner_microstep: 4.34 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3750, -3.9531, -0.9141,  2.2812, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -0.0300,  1.6641, -0.4023, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562,  0.7188,  3.5000, -1.3281, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:28:46,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:28:46,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.40 | bwd_microstep: 2.19 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.73
[2025-11-06 18:28:46,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.69 | bwd: 6.67 | bwd_inner: 5.68 | bwd_allreduce: 0.84 | step: 2.83
 51%|█████▏    | 1800/3507 [44:00<40:42,  1.43s/it]                                                   {'loss': 1.2563, 'learning_rate': 1.0060041839323827e-05, 'epoch': 0.51}
 51%|█████▏    | 1800/3507 [44:00<40:42,  1.43s/it]tensor([[-5.2500, -4.5000, -0.6992,  2.1562, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.6094, -0.3555,  1.8125, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -0.4824,  2.7656, -0.1621, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:46,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.67 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9062, -1.5391,  2.4531,  4.2500, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -2.5938,  0.8633,  2.4844, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3750,  1.5078,  2.9062, -2.3906, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156,  0.4531,  3.2500, -2.1406, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -3.6562,  1.3203,  0.9531, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:28:47,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.29 | optimizer_step: 0.22
[2025-11-06 18:28:47,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 62.99 | bwd_microstep: 803.42 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 802.31 | step_microstep: 2.55
[2025-11-06 18:28:47,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 253.69 | bwd: 804.50 | bwd_inner: 1.98 | bwd_allreduce: 802.37 | step: 2.64
 51%|█████▏    | 1801/3507 [44:01<37:48,  1.33s/it]                                                   {'loss': 0.1829, 'learning_rate': 1.0050804719974402e-05, 'epoch': 0.51}
 51%|█████▏    | 1801/3507 [44:01<37:48,  1.33s/it]tensor([[-3.0625, -0.0415,  2.3594, -0.4844, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -3.9219, -1.0156,  3.4531, -1.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4219, -3.2656, -0.5508,  2.9844, -1.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -2.3125,  2.2500, -0.5781, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:47,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.56 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.2031,  1.6797,  3.2500, -2.0156, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -2.7188,  1.3281,  1.6484, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[h264 @ 0xc54d240] mmco: unref short failure
[h264 @ 0xc54d240] mmco: unref short failure
tensor([[-3.6250, -0.7891,  3.8281,  2.4062, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1562, -5.6250, -0.4023,  2.0469, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:49,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:28:49,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.50 | bwd_microstep: 2.51 | bwd_inner_microstep: 1.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 2.19
[2025-11-06 18:28:49,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.02 | bwd: 3.40 | bwd_inner: 2.50 | bwd_allreduce: 0.79 | step: 2.26
 51%|█████▏    | 1802/3507 [44:03<44:34,  1.57s/it]                                                   {'loss': 0.2088, 'learning_rate': 1.004156755727483e-05, 'epoch': 0.51}
 51%|█████▏    | 1802/3507 [44:03<44:34,  1.57s/it]tensor([[-4.3438, -2.0938,  1.3984,  0.7734, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4062, -0.1787,  1.1406, -2.5625, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -1.3984,  1.3359,  0.9258, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -0.4980,  3.8281, -1.1562, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5938,  1.5781,  2.6250, -3.4219, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:28:49,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.02 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([2], device='cuda:1')
tensor([[ 0.3047, -0.3438,  0.5820,  4.3438,  1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -0.6875,  3.5625, -0.7578, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -4.7500, -1.9062,  2.4531, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:50,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.13 | optimizer_step: 0.17
[2025-11-06 18:28:50,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.28 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.64 | step_microstep: 2.08
[2025-11-06 18:28:50,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 474.33 | bwd: 2.87 | bwd_inner: 2.09 | bwd_allreduce: 0.66 | step: 2.15
 51%|█████▏    | 1803/3507 [44:04<35:35,  1.25s/it]                                                   {'loss': 0.2763, 'learning_rate': 1.0032330359106919e-05, 'epoch': 0.51}
 51%|█████▏    | 1803/3507 [44:04<35:35,  1.25s/it]tensor([[-4.5938, -1.7656,  2.3281,  0.9219, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -1.9688,  1.4531,  1.3984, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -4.8750, -2.5312,  2.2812, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0312, -3.3281,  0.1167, -1.4375, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:50,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.71 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.8125, -1.0547,  1.3984,  1.1094, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7500, -4.4375,  0.4238,  0.7812, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -0.6211,  3.0156, -1.5391, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-9.8125, -8.0000, -3.6875, -2.0156, -6.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:28:52,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 18:28:52,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.67 | bwd_microstep: 520.68 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 519.49 | step_microstep: 2.37
[2025-11-06 18:28:52,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.40 | bwd: 521.68 | bwd_inner: 1.97 | bwd_allreduce: 519.55 | step: 2.47
 51%|█████▏    | 1804/3507 [44:06<46:40,  1.64s/it]                                                   {'loss': 0.5774, 'learning_rate': 1.0023093133352478e-05, 'epoch': 0.51}
 51%|█████▏    | 1804/3507 [44:06<46:40,  1.64s/it]tensor([[-4.8750, -1.8750,  1.6562, -0.6836, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -4.0000, -1.0078,  1.6328, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:52,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.43 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0000, -1.2500,  2.8125, -0.8984, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8516,  1.4297,  2.5781, -1.6875, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9375, -2.6719, -0.7852,  1.7109, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -2.4688,  2.2031,  0.6797, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3750, -2.5000,  0.7227, -1.4844, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.4219,  0.1816,  0.2334, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:28:53,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:28:53,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.00 | bwd_microstep: 114.18 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 113.36 | step_microstep: 2.10
[2025-11-06 18:28:53,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.44 | bwd: 115.04 | bwd_inner: 1.49 | bwd_allreduce: 113.40 | step: 2.18
 51%|█████▏    | 1805/3507 [44:07<36:23,  1.28s/it]                                                   {'loss': 0.5006, 'learning_rate': 1.0013855887893362e-05, 'epoch': 0.51}
 51%|█████▏    | 1805/3507 [44:07<36:23,  1.28s/it]tensor([[-1.7578, -0.3340,  1.7578,  1.9766, -0.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -2.4062,  2.1562,  0.6641, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -2.4844,  1.0234, -0.4043, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:53,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.96 | bwd_microstep: 0.59 | bwd_inner_microstep: 0.50 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8438, -2.0469,  2.5312, -1.1562, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -3.5781,  0.5820,  0.7852, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -1.8281,  2.0156, -1.5859, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719,  0.5273,  3.4375, -1.9766, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-9.1250, -8.0625, -3.7812, -0.9062, -6.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:55,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:28:55,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.92 | bwd_microstep: 2.24 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.28
[2025-11-06 18:28:55,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.88 | bwd: 2.83 | bwd_inner: 1.78 | bwd_allreduce: 0.91 | step: 2.35
 51%|█████▏    | 1806/3507 [44:09<44:35,  1.57s/it]                                                   {'loss': 0.2446, 'learning_rate': 1.0004618630611435e-05, 'epoch': 0.51}
 51%|█████▏    | 1806/3507 [44:09<44:35,  1.57s/it]tensor([[-3.4844, -2.4219,  0.4551,  2.1250, -1.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -2.1719,  0.8750,  0.8750, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:55,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.87 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.8750, -4.2500, -1.8672,  2.5312, -1.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -2.7812,  2.0000,  1.0547, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -3.2812,  1.5156,  0.2432, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8125,  0.8203,  3.8438, -2.7031, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -1.5859,  1.5469,  0.3613, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -2.7656,  0.8828,  1.6094, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:28:56,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:28:56,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.32 | bwd_microstep: 165.50 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 164.70 | step_microstep: 1.74
[2025-11-06 18:28:56,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.21 | bwd: 166.40 | bwd_inner: 1.53 | bwd_allreduce: 164.74 | step: 1.81
 52%|█████▏    | 1807/3507 [44:09<35:56,  1.27s/it]                                                   {'loss': 0.5574, 'learning_rate': 9.99538136938857e-06, 'epoch': 0.52}
 52%|█████▏    | 1807/3507 [44:09<35:56,  1.27s/it][h264 @ 0x37cedcc0] mmco: unref short failure
tensor([[-6.5000, -4.0312,  1.4375,  1.6484, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:28:56,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.06 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.6250, -3.5469,  0.3047,  2.5625, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -4.9688, -2.0938,  1.8359, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -0.1934,  3.3281, -0.3145, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -5.9688, -2.7812,  1.6719, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -2.7812,  0.8984,  0.4570, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -2.2344,  1.4844, -0.0566, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.2988, 2.0938, 3.4531, 2.4844, 0.3477]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:28:58,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:28:58,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.45 | bwd_microstep: 1185.29 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1184.13 | step_microstep: 1.92
[2025-11-06 18:28:58,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.51 | bwd: 1186.00 | bwd_inner: 1.67 | bwd_allreduce: 1184.17 | step: 2.02
 52%|█████▏    | 1808/3507 [44:12<44:52,  1.58s/it]                                                   {'loss': 0.4811, 'learning_rate': 9.98614411210664e-06, 'epoch': 0.52}
 52%|█████▏    | 1808/3507 [44:12<44:52,  1.58s/it]tensor([[-5.9062, -4.7500, -1.1875,  0.6016, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -2.7969,  1.8516, -1.2656, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -3.2344,  1.6406,  1.7578, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:58,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.20 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9844, -1.8594,  0.9062,  4.4062, -0.1348]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -4.5000, -1.8047,  0.6523, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -3.3750, -0.0781,  1.2734, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -0.6289,  3.6562,  3.4375, -1.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -2.2500,  1.4375,  1.5391, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:28:58,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:28:58,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.92 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.66
[2025-11-06 18:28:58,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.15 | bwd: 2.81 | bwd_inner: 1.94 | bwd_allreduce: 0.73 | step: 1.74
 52%|█████▏    | 1809/3507 [44:12<34:36,  1.22s/it]                                                   {'loss': 0.3172, 'learning_rate': 9.976906866647526e-06, 'epoch': 0.52}
 52%|█████▏    | 1809/3507 [44:12<34:36,  1.22s/it]tensor([[-2.6562,  1.2812,  3.0000, -2.4219, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:28:58,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.42 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.0000, -5.4062, -1.6172,  1.8047, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -2.4844,  1.8906,  1.3984, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5781, -1.3906,  2.1094,  1.7500, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969,  0.3066,  2.5469, -1.6328, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -3.8906,  0.3398,  2.9375, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -4.5938,  0.3223,  1.9375, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5469, -2.7344,  0.4180,  2.9688, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:29:01,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:29:01,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.84 | bwd_microstep: 1552.64 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1551.57 | step_microstep: 1.76
[2025-11-06 18:29:01,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.25 | bwd: 1553.50 | bwd_inner: 1.73 | bwd_allreduce: 1551.62 | step: 1.83
 52%|█████▏    | 1810/3507 [44:15<48:24,  1.71s/it]                                                   {'loss': 0.7546, 'learning_rate': 9.967669640893085e-06, 'epoch': 0.52}
 52%|█████▏    | 1810/3507 [44:15<48:24,  1.71s/it]tensor([[-6.7812, -3.1250,  1.9688, -0.7969, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -5.2812, -1.3359,  2.8906, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:29:01,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.49 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -3.7500,  0.3730,  3.5312, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -1.5859,  2.3750,  0.0466, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -3.9844, -0.4551,  2.1406, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -2.4688,  2.0781,  1.1484, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1562, -3.1719,  2.1250,  0.8867, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7188, -1.8984,  1.1250,  3.2344, -1.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:29:02,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 18:29:02,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.98 | bwd_microstep: 64.25 | bwd_inner_microstep: 2.99 | bwd_allreduce_microstep: 61.15 | step_microstep: 1.98
[2025-11-06 18:29:02,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 443.50 | bwd: 65.09 | bwd_inner: 3.75 | bwd_allreduce: 61.18 | step: 2.05
 52%|█████▏    | 1811/3507 [44:15<38:32,  1.36s/it]                                                   {'loss': 0.8493, 'learning_rate': 9.95843244272517e-06, 'epoch': 0.52}
 52%|█████▏    | 1811/3507 [44:15<38:32,  1.36s/it]tensor([[-3.7500, -3.6875, -0.3906,  4.0625, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1641, -1.5000, -1.2422,  1.0547,  0.1074]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.9688, -0.7148,  2.6719,  1.8203, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -1.9297,  2.1562,  0.9102, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:02,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 257.02 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.8438, -5.7188, -1.0000,  1.8750, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0469,  0.5352,  3.3750, -0.6055, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5938,  1.4375,  3.3281, -2.5781, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -2.8281,  0.5625, -1.3828, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:29:04,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:29:04,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.85 | bwd_microstep: 841.43 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 840.33 | step_microstep: 2.04
[2025-11-06 18:29:04,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 468.90 | bwd: 842.38 | bwd_inner: 1.84 | bwd_allreduce: 840.38 | step: 2.13
 52%|█████▏    | 1812/3507 [44:17<43:39,  1.55s/it]                                                   {'loss': 0.5657, 'learning_rate': 9.9491952800256e-06, 'epoch': 0.52}
 52%|█████▏    | 1812/3507 [44:17<43:39,  1.55s/it]tensor([[-3.0781, -3.3281, -1.6328,  1.7578, -1.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -1.7109,  0.5977, -3.3906, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -4.6875, -1.2812,  1.5391, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:04,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.21 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8906, -3.4375, -0.7812,  2.2500, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3516,  2.8750,  4.7188, -1.1953, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -3.7812, -1.5078,  2.1562, -1.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -2.7344,  1.1953, -1.1328, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1406, -2.0312,  1.4219,  3.5781, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:29:04,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:29:04,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 128.05 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 126.73 | step_microstep: 2.65
[2025-11-06 18:29:04,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.67 | bwd: 129.00 | bwd_inner: 2.11 | bwd_allreduce: 126.77 | step: 2.73
 52%|█████▏    | 1813/3507 [44:18<35:09,  1.25s/it]                                                   {'loss': 0.1002, 'learning_rate': 9.93995816067618e-06, 'epoch': 0.52}
 52%|█████▏    | 1813/3507 [44:18<35:09,  1.25s/it]tensor([[-3.8438, -0.3965,  3.0938, -0.4160, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.5469,  0.5664,  0.9609, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9688, -6.0312, -1.9375,  0.8008, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -2.9844,  0.9375,  2.0469, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:04,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.35 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4844,  0.4570,  2.5000, -0.5977, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -3.0312, -0.3516,  0.4590, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -3.1406, -0.7070,  2.2344, -1.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -4.1875, -0.4707,  3.0156, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:06,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.45 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:29:06,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.06 | bwd_microstep: 1.74 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.79 | step_microstep: 5.23
[2025-11-06 18:29:06,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 443.42 | bwd: 2.51 | bwd_inner: 1.55 | bwd_allreduce: 0.83 | step: 5.31
 52%|█████▏    | 1814/3507 [44:20<40:29,  1.44s/it]                                                   {'loss': 0.3113, 'learning_rate': 9.930721092558648e-06, 'epoch': 0.52}
 52%|█████▏    | 1814/3507 [44:20<40:29,  1.44s/it]tensor([[-1.6719,  1.5547,  1.7344, -2.5312, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.3750, -1.1406,  1.1719, -1.8984, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -4.1250, -0.6797,  0.0190, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:06,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.05 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6406, -1.7266,  1.5391,  1.7734, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031,  0.7422,  2.9219, -1.3828, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -2.9375,  1.2578,  1.5469, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -2.9062,  1.9922,  0.6758, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562,  0.6562,  3.3906, -1.6562, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:29:06,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.16 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:29:06,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.72 | bwd_microstep: 86.71 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 85.61 | step_microstep: 3.37
[2025-11-06 18:29:06,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.79 | bwd: 87.57 | bwd_inner: 1.79 | bwd_allreduce: 85.65 | step: 3.44
 52%|█████▏    | 1815/3507 [44:20<31:57,  1.13s/it]                                                   {'loss': 0.5703, 'learning_rate': 9.92148408355473e-06, 'epoch': 0.52}
 52%|█████▏    | 1815/3507 [44:20<31:57,  1.13s/it]tensor([[-3.8125, -3.6875, -0.8555,  2.7188, -1.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -1.8125,  0.3887,  1.8359, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -1.9375,  1.8984, -0.7383, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -0.7188,  2.3594,  0.2656, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:07,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.0312, -2.9375,  1.6172, -0.2539, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -2.9219,  0.9570,  1.9297, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438,  0.1099,  3.9531, -1.6328, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -4.4375, -0.3887,  0.6875, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:09,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:29:09,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.68 | bwd_microstep: 1.85 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 3.32
[2025-11-06 18:29:09,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.27 | bwd: 2.82 | bwd_inner: 1.90 | bwd_allreduce: 0.79 | step: 3.40
 52%|█████▏    | 1816/3507 [44:23<46:27,  1.65s/it]                                                   {'loss': 0.2937, 'learning_rate': 9.91224714154608e-06, 'epoch': 0.52}
 52%|█████▏    | 1816/3507 [44:23<46:27,  1.65s/it]tensor([[-2.1562,  0.9727,  2.5469, -1.5859, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -4.3438, -2.0781,  2.3750, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -0.0303,  3.5156, -2.5938, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -3.0938,  1.4453,  1.3438, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:29:10,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.65 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.2812, -4.2188,  0.0559,  0.3906, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -5.5312, -0.8750,  2.5469, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -2.8750,  1.4609,  2.0625, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -4.4375, -0.7578,  2.1094, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:29:10,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:29:10,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.94 | bwd_microstep: 101.96 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 100.82 | step_microstep: 3.00
[2025-11-06 18:29:10,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 449.56 | bwd: 102.75 | bwd_inner: 1.77 | bwd_allreduce: 100.85 | step: 3.08
 52%|█████▏    | 1817/3507 [44:24<37:35,  1.33s/it]                                                   {'loss': 0.3286, 'learning_rate': 9.90301027441431e-06, 'epoch': 0.52}
 52%|█████▏    | 1817/3507 [44:24<37:35,  1.33s/it]tensor([[-5.6875, -5.2188, -1.4297,  2.2188, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:10,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.52 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7812, -5.0938, -2.4688,  2.1562, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2656,  1.4453,  3.2969, -1.3516, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -4.4688,  0.9609,  1.4688, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -3.7812, -0.4688,  3.6875, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.7500, -4.3750,  1.2031, -0.8320, -6.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -2.3281,  1.4531, -0.0361, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875, -2.0312,  0.5547,  0.6953, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:11,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:29:11,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.31 | bwd_microstep: 2.34 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.57
[2025-11-06 18:29:11,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 248.84 | bwd: 3.32 | bwd_inner: 2.32 | bwd_allreduce: 0.87 | step: 2.66
 52%|█████▏    | 1818/3507 [44:25<38:06,  1.35s/it]                                                   {'loss': 0.2368, 'learning_rate': 9.89377349004096e-06, 'epoch': 0.52}
 52%|█████▏    | 1818/3507 [44:25<38:06,  1.35s/it]tensor([[-5.2812, -4.5312, -1.1875,  1.4531, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -5.0625, -2.9375,  1.5312, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:29:11,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.79 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6250, -3.0312,  1.7969,  1.0234, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -4.2188, -1.0000,  1.3828, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8906,  1.4844,  2.6094, -1.6641, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -1.0391,  3.8125,  1.1094, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -2.7031,  0.3301, -0.6719, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -1.7578,  1.9688,  2.1406, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:29:12,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.82 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:29:12,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.55 | bwd_microstep: 95.10 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 94.06 | step_microstep: 4.53
[2025-11-06 18:29:12,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.36 | bwd: 95.78 | bwd_inner: 1.51 | bwd_allreduce: 94.11 | step: 4.63
 52%|█████▏    | 1819/3507 [44:26<30:14,  1.07s/it]                                                   {'loss': 0.6937, 'learning_rate': 9.884536796307497e-06, 'epoch': 0.52}
 52%|█████▏    | 1819/3507 [44:26<30:14,  1.07s/it]tensor([[-2.0781,  2.2031,  3.5938, -2.7500, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.9688, -5.2500, -2.8438,  1.4375, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-2.1875,  1.0547,  2.9844, -1.1094, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:29:12,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.46 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0000, -4.6875, -1.1797,  2.6875, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -4.5938, -1.5391,  3.2969, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.6875, -0.8789,  3.0938, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -4.8750, -0.9531,  2.8125, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -0.3203,  1.7734, -0.8594, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:15,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:29:15,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.91 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.61
[2025-11-06 18:29:15,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.39 | bwd: 2.75 | bwd_inner: 1.72 | bwd_allreduce: 0.91 | step: 2.69
 52%|█████▏    | 1820/3507 [44:29<48:51,  1.74s/it]                                                   {'loss': 1.4493, 'learning_rate': 9.875300201095312e-06, 'epoch': 0.52}
 52%|█████▏    | 1820/3507 [44:29<48:51,  1.74s/it]tensor([[-1.0234,  2.7656,  4.4375, -0.7812, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8438, -4.3438, -0.1807, -0.6055, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -5.5312, -1.4375,  1.4844, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -4.3125, -2.1562,  2.1094, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:15,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4844, -0.9375,  1.2109, -0.6953, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8828,  1.4766,  2.2500, -1.9453, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9375, -3.9062, -1.4844,  1.9688, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -4.0000, -0.5039,  3.0312, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:29:15,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:29:15,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.19 | bwd_microstep: 44.28 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 43.47 | step_microstep: 1.80
[2025-11-06 18:29:15,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.51 | bwd: 45.13 | bwd_inner: 1.49 | bwd_allreduce: 43.51 | step: 1.89
 52%|█████▏    | 1821/3507 [44:29<37:54,  1.35s/it]                                                   {'loss': 0.5819, 'learning_rate': 9.866063712285724e-06, 'epoch': 0.52}
 52%|█████▏    | 1821/3507 [44:29<37:54,  1.35s/it]tensor([[-3.8125, -0.4434,  1.9609, -1.7031, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656,  0.4805,  2.3750, -0.7656, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:16,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.14 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9375, -4.0938, -1.1484,  1.0703, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -3.9062,  0.6758,  0.2305, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4531, -2.8594, -1.6562,  2.1406, -0.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -1.7344,  0.7383,  1.1406, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.9688, -1.8750,  2.1250, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -2.6719,  1.6094,  3.0000, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:18,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:29:18,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.53 | bwd_microstep: 1.90 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.12
[2025-11-06 18:29:18,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 494.69 | bwd: 2.75 | bwd_inner: 1.77 | bwd_allreduce: 0.86 | step: 2.20
 52%|█████▏    | 1822/3507 [44:32<46:49,  1.67s/it]                                                   {'loss': 0.2901, 'learning_rate': 9.85682733775994e-06, 'epoch': 0.52}
 52%|█████▏    | 1822/3507 [44:32<46:49,  1.67s/it]tensor([[-5.4375, -2.8594,  1.3047,  0.3750, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2031, -2.6719, -0.9805,  3.2188, -0.0967]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -2.7969,  1.5078,  1.5938, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -3.7344,  0.0977,  2.2500, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:18,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.50 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1562, -3.0000,  0.3984,  2.0469, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -2.0625,  1.5078,  0.5469, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -3.3438, -0.0165,  2.7500, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.4062,  1.5156,  1.5312, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:18,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:29:18,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.12 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.47
[2025-11-06 18:29:18,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.65 | bwd: 2.71 | bwd_inner: 1.83 | bwd_allreduce: 0.76 | step: 1.54
 52%|█████▏    | 1823/3507 [44:32<36:19,  1.29s/it]                                                   {'loss': 0.4432, 'learning_rate': 9.847591085399089e-06, 'epoch': 0.52}
 52%|█████▏    | 1823/3507 [44:32<36:19,  1.29s/it]tensor([[-4.4375, -0.6992,  2.9375, -0.8164, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7656, -2.9375, -1.2578,  2.1562, -0.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.2188, -3.2031, -0.0525,  1.8438, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -4.3438,  1.0547,  1.5234, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:19,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.07 | bwd_microstep: 1.14 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.0625, -2.6562,  0.0903,  3.1562, -1.1484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -0.9219,  2.6875, -0.3379, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8438, -4.8438,  0.5938,  1.8203, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -1.7031,  2.9531,  0.0713, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:20,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:29:20,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.07 | bwd_microstep: 1.90 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.32
[2025-11-06 18:29:20,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.16 | bwd: 3.04 | bwd_inner: 2.00 | bwd_allreduce: 0.87 | step: 2.43
 52%|█████▏    | 1824/3507 [44:34<39:22,  1.40s/it]                                                   {'loss': 1.2098, 'learning_rate': 9.838354963084187e-06, 'epoch': 0.52}
 52%|█████▏    | 1824/3507 [44:34<39:22,  1.40s/it]tensor([[-2.5469,  0.8477,  2.9531, -1.4766, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -1.1172,  2.9531,  1.1016, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:20,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.43 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1875, -4.4375, -0.6250,  2.3438, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -2.6406,  1.2031,  0.8984, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375, -1.4219,  1.9219,  0.9844, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -3.2188,  0.7305, -0.5430, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -2.3125,  1.2188,  1.2734, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -3.0312,  2.4844,  0.2617, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:29:20,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:29:20,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.52 | bwd_microstep: 73.41 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 72.45 | step_microstep: 1.64
[2025-11-06 18:29:20,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.96 | bwd: 74.17 | bwd_inner: 1.55 | bwd_allreduce: 72.49 | step: 1.72
 52%|█████▏    | 1825/3507 [44:34<31:14,  1.11s/it]                                                   {'loss': 0.3318, 'learning_rate': 9.829118978696136e-06, 'epoch': 0.52}
 52%|█████▏    | 1825/3507 [44:34<31:14,  1.11s/it]tensor([[-5.1562, -1.8984,  1.3047, -1.5938, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7188, -4.0625, -1.8672,  2.4844, -1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:21,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.45 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-5.5625, -5.2500, -1.7031,  2.3906, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7500,  1.2109,  2.5000, -0.9062, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.8125, -1.2656,  2.4375, -1.0547, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -3.1562,  1.9375,  1.7188, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -4.6562, -1.2969,  2.0312, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -4.6875,  0.1182,  2.2031, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:24,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.22 | optimizer_step: 0.27
[2025-11-06 18:29:24,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.05 | bwd_microstep: 2.48 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1.16 | step_microstep: 3.38
[2025-11-06 18:29:24,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.53 | bwd: 3.44 | bwd_inner: 2.00 | bwd_allreduce: 1.23 | step: 3.52
 52%|█████▏    | 1826/3507 [44:38<53:07,  1.90s/it]                                                   {'loss': 0.3462, 'learning_rate': 9.819883140115722e-06, 'epoch': 0.52}
 52%|█████▏    | 1826/3507 [44:38<53:07,  1.90s/it]tensor([[-3.4844,  0.2480,  3.0938, -1.4141, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438,  0.3027,  2.2812, -1.0000, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.4688, -3.7812,  0.1167,  0.8398, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -4.4062, -0.6250,  3.3750, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:24,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0000,  0.5586,  3.5156, -2.7188, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4062, -3.6875,  1.5781,  0.8359, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3438, -3.7500, -0.9570,  3.8281, -0.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -1.9375,  2.1094, -0.2715, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:29:25,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:29:25,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.25 | bwd_microstep: 245.12 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 244.19 | step_microstep: 1.81
[2025-11-06 18:29:25,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 271.74 | bwd: 245.92 | bwd_inner: 1.56 | bwd_allreduce: 244.24 | step: 1.88
 52%|█████▏    | 1827/3507 [44:38<41:46,  1.49s/it]                                                   {'loss': 0.4963, 'learning_rate': 9.810647455223615e-06, 'epoch': 0.52}
 52%|█████▏    | 1827/3507 [44:38<41:46,  1.49s/it]tensor([[-4.9688, -2.2188,  2.1562,  1.0312, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -1.9609,  1.4219,  0.0679, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.5000, -0.4707,  2.4219, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6562,  0.7266,  2.7031,  0.5508, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:25,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.22 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.0078,  1.9297,  2.4219, -1.4219, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2812, -1.2969,  2.1094,  0.1367, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.5938,  1.2891,  2.2969, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0938,  1.6016,  3.2031, -2.0781, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:28,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:29:28,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.98 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.31
[2025-11-06 18:29:28,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.19 | bwd: 2.76 | bwd_inner: 1.78 | bwd_allreduce: 0.84 | step: 2.40
 52%|█████▏    | 1828/3507 [44:41<53:17,  1.90s/it]                                                   {'loss': 0.3202, 'learning_rate': 9.801411931900344e-06, 'epoch': 0.52}
 52%|█████▏    | 1828/3507 [44:41<53:17,  1.90s/it]tensor([[-1.3203,  1.7656,  3.3281, -0.4355, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9688, -0.0452,  1.8281, -0.5273, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -3.2188,  1.0781,  1.7188, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -3.9219,  0.5703,  1.9062, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562,  0.0542,  1.2422, -3.5625, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9844, -2.8438, -0.5820,  0.3652, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:28,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.99 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7812, -5.2812, -1.4375,  2.1250, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219,  0.1040,  3.0312, -1.0469, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:28,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:29:28,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.26 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.46
[2025-11-06 18:29:28,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 433.27 | bwd: 2.96 | bwd_inner: 1.93 | bwd_allreduce: 0.90 | step: 1.55
 52%|█████▏    | 1829/3507 [44:42<41:15,  1.48s/it]                                                   {'loss': 0.3879, 'learning_rate': 9.792176578026307e-06, 'epoch': 0.52}
 52%|█████▏    | 1829/3507 [44:42<41:15,  1.48s/it]tensor([[-4.9688, -2.3906,  1.2969, -0.0189, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9688, -5.4375, -0.2344,  2.1094, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156, -3.1406, -0.4141,  3.6406, -0.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:28,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.53 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-0.4902,  2.7812,  2.7344, -2.0000, -1.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.7812, -3.5000, -2.3750,  2.0156, -0.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -1.1328,  1.8750,  0.9531, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -3.6875, -0.2578, -0.6367, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5000, -2.1875, -2.1875,  0.8672,  0.0806]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:29:30,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:29:30,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.91 | bwd_microstep: 2.32 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.56
[2025-11-06 18:29:30,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.41 | bwd: 3.32 | bwd_inner: 2.23 | bwd_allreduce: 0.91 | step: 2.66
 52%|█████▏    | 1830/3507 [44:44<44:00,  1.57s/it]                                                   {'loss': 0.5404, 'learning_rate': 9.782941401481745e-06, 'epoch': 0.52}
 52%|█████▏    | 1830/3507 [44:44<44:00,  1.57s/it]tensor([[-0.4297,  2.8125,  2.7500, -1.8906, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9531, -1.6406,  2.1562,  1.5859, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -5.4062, -0.7695,  2.5938, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9375, -0.1738,  2.0469,  1.5000, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:30,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.63 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8438, -0.6172,  2.7969, -0.0835, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -3.6875, -0.4629,  2.5000, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.2188,  0.5859,  2.2344, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -4.8438, -1.3359,  3.5469, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:30,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:29:30,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.88 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.06
[2025-11-06 18:29:30,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.54 | bwd: 2.62 | bwd_inner: 1.63 | bwd_allreduce: 0.83 | step: 2.14
 52%|█████▏    | 1831/3507 [44:44<34:49,  1.25s/it]                                                   {'loss': 0.9879, 'learning_rate': 9.773706410146764e-06, 'epoch': 0.52}
 52%|█████▏    | 1831/3507 [44:44<34:49,  1.25s/it]tensor([[-4.2500, -1.9766,  0.8281, -0.3906, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -3.4688,  1.6484,  1.8438, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -2.7031,  1.4375,  0.8477, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:31,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.84 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7188, -4.0312,  1.5078,  1.1016, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -0.6914,  3.6562, -0.4375, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0156, -1.4062,  1.4219,  4.0000, -0.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.3438,  0.3945,  1.1094, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -0.3809,  1.7344, -2.3750, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:29:33,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.53 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:29:33,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.07 | bwd_microstep: 2.53 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.94
[2025-11-06 18:29:33,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.93 | bwd: 3.43 | bwd_inner: 2.35 | bwd_allreduce: 0.87 | step: 4.02
 52%|█████▏    | 1832/3507 [44:47<44:39,  1.60s/it]                                                   {'loss': 0.5852, 'learning_rate': 9.764471611901302e-06, 'epoch': 0.52}
 52%|█████▏    | 1832/3507 [44:47<44:39,  1.60s/it]tensor([[-5.3125, -4.2188, -0.4453,  1.8125, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.5000, -0.0698,  2.2344, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:33,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.56 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0312, -4.2500,  0.0540,  1.0703, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -4.1562, -1.3750,  2.8906, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6094,  0.1709,  2.2656, -0.2559, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.9688,  0.7266,  2.5156, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -2.9844,  0.6953,  2.0312, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -4.0625, -1.3672,  2.6406, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:29:33,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:29:33,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 29.67 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 28.80 | step_microstep: 2.25
[2025-11-06 18:29:33,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.13 | bwd: 30.57 | bwd_inner: 1.58 | bwd_allreduce: 28.84 | step: 2.33
 52%|█████▏    | 1833/3507 [44:47<34:53,  1.25s/it]                                                   {'loss': 0.4381, 'learning_rate': 9.755237014625136e-06, 'epoch': 0.52}
 52%|█████▏    | 1833/3507 [44:47<34:53,  1.25s/it]tensor([[-3.8594, -0.8164,  2.7500,  0.3184, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:33,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.15 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.3906,  1.2266,  2.2344, -0.3594, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -0.6992,  2.0781,  0.2930, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3906, -4.0938, -2.6562,  1.8203, -0.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -4.0312,  0.4824,  2.3438, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -1.7031,  3.1250, -0.9102, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -4.2188, -2.2812,  2.1406, -1.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -3.6719,  1.0469,  2.9375, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:36,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:29:36,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.66 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.19
[2025-11-06 18:29:36,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.84 | bwd: 2.73 | bwd_inner: 1.67 | bwd_allreduce: 0.92 | step: 2.27
 52%|█████▏    | 1834/3507 [44:50<46:53,  1.68s/it]                                                   {'loss': 0.6161, 'learning_rate': 9.746002626197873e-06, 'epoch': 0.52}
 52%|█████▏    | 1834/3507 [44:50<46:53,  1.68s/it]tensor([[-3.8750, -4.2188, -1.4766,  3.0469, -1.3516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -2.0625,  1.0078, -0.5312, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:36,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.83 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7500, -3.2656,  0.6250,  1.9141, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6875, -6.2188, -2.2344, -0.1611, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4375, -6.6562, -2.4375,  0.9414, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5469,  0.6992,  2.9375, -2.7656, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -4.3125, -1.8281,  2.9688, -1.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.3906,  0.6680,  2.0625, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:29:37,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:29:37,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.55 | bwd_microstep: 994.88 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 993.73 | step_microstep: 1.65
[2025-11-06 18:29:37,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.41 | bwd: 995.78 | bwd_inner: 1.87 | bwd_allreduce: 993.77 | step: 1.74
 52%|█████▏    | 1835/3507 [44:51<44:04,  1.58s/it]                                                   {'loss': 0.7264, 'learning_rate': 9.736768454498935e-06, 'epoch': 0.52}
 52%|█████▏    | 1835/3507 [44:51<44:04,  1.58s/it]tensor([[-3.3750, -0.9922,  2.0312,  0.5742, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:37,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 94.12 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.05
tensor([[-6.4375, -5.0625, -0.2910,  1.9297, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.1250, -6.3438, -1.9375, -2.8125, -7.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1094,  1.3906,  3.3750, -1.2422, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -4.1562, -0.8438,  2.5156, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -4.0000, -0.3711,  2.3438, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -4.2500, -0.6953,  2.3906, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.4844,  0.5000,  1.6484, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:38,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:29:38,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.17 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.98 | step_microstep: 2.03
[2025-11-06 18:29:38,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.29 | bwd: 2.92 | bwd_inner: 1.77 | bwd_allreduce: 1.01 | step: 2.09
 52%|█████▏    | 1836/3507 [44:52<40:58,  1.47s/it]                                                   {'loss': 0.3095, 'learning_rate': 9.727534507407563e-06, 'epoch': 0.52}
 52%|█████▏    | 1836/3507 [44:52<40:58,  1.47s/it]tensor([[4.7812, 7.1562, 6.2188, 2.2812, 2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.5312, -5.0312, -1.3125,  1.9688, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3438, -4.7812, -0.0664,  1.9609, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:39,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.06 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.1250, -5.0312, -0.7109,  2.1250, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6875, -5.5938, -1.4141,  1.4844, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -1.6250,  2.2812, -0.2148, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -3.3906,  0.0234,  4.2812, -0.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-9.6250, -7.6875, -2.1406, -0.2520, -6.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:29:41,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:29:41,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.36 | bwd_microstep: 2523.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 2522.98 | step_microstep: 2.19
[2025-11-06 18:29:41,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 472.46 | bwd: 2524.86 | bwd_inner: 1.69 | bwd_allreduce: 2523.02 | step: 2.26
 52%|█████▏    | 1837/3507 [44:55<54:02,  1.94s/it]                                                   {'loss': 0.3566, 'learning_rate': 9.718300792802808e-06, 'epoch': 0.52}
 52%|█████▏    | 1837/3507 [44:55<54:02,  1.94s/it]tensor([[-3.8750, -0.4746,  0.7578, -3.1406, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3359,  1.5703,  3.8594,  0.6641, -1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -1.7656,  1.2812,  0.7812, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -3.7031, -0.6836,  2.0938, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:42,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.11 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8125, -2.0469,  2.0938,  0.6875, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -1.3984,  1.9766, -0.0444, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0000,  2.7969,  3.0938, -2.2812, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.3125, -3.1406,  2.0000,  0.3262, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:42,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.25 | optimizer_step: 0.28
[2025-11-06 18:29:42,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.83 | bwd_microstep: 2.16 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1.09 | step_microstep: 3.37
[2025-11-06 18:29:42,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.95 | bwd: 2.93 | bwd_inner: 1.58 | bwd_allreduce: 1.13 | step: 3.41
 52%|█████▏    | 1838/3507 [44:56<45:31,  1.64s/it]                                                   {'loss': 0.308, 'learning_rate': 9.70906731856352e-06, 'epoch': 0.52}
 52%|█████▏    | 1838/3507 [44:56<45:31,  1.64s/it]tensor([[-4.0938, -4.9062, -2.8906,  2.1875, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -1.2812,  2.2969,  0.7500, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -2.0156,  1.5156,  0.1875, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -0.6289,  2.9844, -0.8750, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:43,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.31 | bwd_microstep: 1.41 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7969, -2.4531,  0.9141,  4.4062, -0.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -2.8438,  0.6484,  3.4375, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1875, -4.1250, -2.3906,  2.7969, -0.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0000, -1.2266,  1.6797, -2.5625, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:29:43,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:29:43,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.88 | bwd_microstep: 3.35 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2.23 | step_microstep: 1.58
[2025-11-06 18:29:43,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.21 | bwd: 4.75 | bwd_inner: 2.37 | bwd_allreduce: 2.26 | step: 1.65
 52%|█████▏    | 1839/3507 [44:57<35:35,  1.28s/it]                                                   {'loss': 0.2914, 'learning_rate': 9.69983409256835e-06, 'epoch': 0.52}
 52%|█████▏    | 1839/3507 [44:57<35:35,  1.28s/it]tensor([[-5.0938, -3.6094,  0.3516,  1.4531, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.2031,  1.9688,  1.0234, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0781, -2.0156,  1.3359,  5.4375,  0.0337]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -4.8125,  0.1152,  1.4688, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:43,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.00 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.17
tensor([[-3.8750, -0.5859,  2.1094, -0.8281, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8125, -0.5078,  1.9922, -1.5000, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -2.6562,  1.0000,  2.8281, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -0.0698,  1.9531, -1.3906, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:43,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:29:43,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.97 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.71 | step_microstep: 1.47
[2025-11-06 18:29:43,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.99 | bwd: 2.66 | bwd_inner: 1.80 | bwd_allreduce: 0.74 | step: 1.65
 52%|█████▏    | 1840/3507 [44:57<28:33,  1.03s/it]                                                   {'loss': 0.3, 'learning_rate': 9.690601122695727e-06, 'epoch': 0.52}
 52%|█████▏    | 1840/3507 [44:57<28:33,  1.03s/it]tensor([[-4.0625, -0.2871,  2.7031, -1.9688, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.1406, -3.2812, -1.4062,  2.0625, -1.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.4688, -3.9375, -0.6484,  2.7500, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -3.5781, -0.7812,  2.1250, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719,  0.5469,  3.0312, -0.6875, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:44,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 79.25 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6250, -2.2344,  1.1562,  0.5000, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1875, -1.4688,  1.0469,  2.9688, -0.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.7500, -6.4688, -1.0547,  1.7812, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:29:46,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:29:46,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.26 | bwd_microstep: 1722.19 | bwd_inner_microstep: 1.71 | bwd_allreduce_microstep: 1720.38 | step_microstep: 2.55
[2025-11-06 18:29:46,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 254.52 | bwd: 1723.08 | bwd_inner: 2.50 | bwd_allreduce: 1720.43 | step: 2.64
 52%|█████▏    | 1841/3507 [45:00<46:04,  1.66s/it]                                                   {'loss': 0.9818, 'learning_rate': 9.681368416823869e-06, 'epoch': 0.52}
 52%|█████▏    | 1841/3507 [45:00<46:04,  1.66s/it]tensor([[-5.5938, -3.6719,  0.5898,  1.0781, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:47,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.59 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.6250, -5.3438, -1.6641,  2.3594, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469, -1.6328,  1.5469,  2.5938, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.3125, -1.4922,  1.6484, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -3.2656,  0.9023,  3.2031, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -2.9531, -0.5234,  1.8750, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -1.6797,  2.7812, -0.7109, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0625, -6.3750, -1.5391,  2.4844, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:47,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:29:47,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.80 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.52
[2025-11-06 18:29:47,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.42 | bwd: 2.77 | bwd_inner: 1.87 | bwd_allreduce: 0.79 | step: 1.59
 53%|█████▎    | 1842/3507 [45:01<36:41,  1.32s/it]                                                   {'loss': 0.2093, 'learning_rate': 9.672135982830761e-06, 'epoch': 0.53}
 53%|█████▎    | 1842/3507 [45:01<36:41,  1.32s/it]tensor([[-4.6250, -2.4844,  1.1016,  0.8242, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:47,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.72 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0312, -1.5781,  1.8984, -1.3594, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -3.8125,  1.7188,  0.7305, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -3.3750,  1.3281,  2.1250, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2969, -2.1875,  0.7500,  2.1562, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -2.6250,  2.5781,  0.5703, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0312,  0.4805,  2.2344, -2.2500, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.3438, -3.4844, -1.2578,  2.5000, -1.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:29:50,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:29:50,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.84 | bwd_microstep: 1803.05 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1802.04 | step_microstep: 1.80
[2025-11-06 18:29:50,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.58 | bwd: 1804.18 | bwd_inner: 1.94 | bwd_allreduce: 1802.08 | step: 1.88
 53%|█████▎    | 1843/3507 [45:03<48:29,  1.75s/it]                                                   {'loss': 0.5878, 'learning_rate': 9.662903828594172e-06, 'epoch': 0.53}
 53%|█████▎    | 1843/3507 [45:03<48:29,  1.75s/it]tensor([[-6.0938, -3.9688,  0.4727,  0.6914, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -4.4375,  0.1641,  1.3828, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -3.6719,  0.4062,  0.9219, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2188, -3.2656,  2.0312,  0.8086, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:50,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 300.06 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-3.9062, -4.1875, -1.6328,  2.4688, -1.5547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.3281,  0.2812,  2.9844, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -1.3750,  1.7578, -0.7344, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -3.6562, -0.4180,  3.9375, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:50,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.25
[2025-11-06 18:29:50,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.21 | bwd_microstep: 1.99 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.90 | step_microstep: 1.80
[2025-11-06 18:29:50,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 505.31 | bwd: 3.08 | bwd_inner: 1.94 | bwd_allreduce: 0.96 | step: 1.91
 53%|█████▎    | 1844/3507 [45:04<38:32,  1.39s/it]                                                   {'loss': 0.9939, 'learning_rate': 9.653671961991613e-06, 'epoch': 0.53}
 53%|█████▎    | 1844/3507 [45:04<38:32,  1.39s/it]tensor([[-3.7344, -0.0564,  2.8438, -1.3828, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -0.7227,  3.5312,  0.1895, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969, -3.6875, -2.6562,  0.7266, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:50,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.95 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[1.1250, 3.1406, 4.9062, 3.7188, 1.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.0625,  0.8711,  2.7969, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -3.1094,  1.2891,  3.3281, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6094, -2.3438,  0.4512,  3.6719, -0.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -3.7031,  1.3984,  0.3574, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:29:52,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:29:52,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.36 | bwd_microstep: 1552.58 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1551.39 | step_microstep: 2.13
[2025-11-06 18:29:52,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 471.34 | bwd: 1553.37 | bwd_inner: 1.80 | bwd_allreduce: 1551.42 | step: 2.20
 53%|█████▎    | 1845/3507 [45:06<44:08,  1.59s/it]                                                   {'loss': 0.1694, 'learning_rate': 9.64444039090036e-06, 'epoch': 0.53}
 53%|█████▎    | 1845/3507 [45:06<44:08,  1.59s/it]tensor([[-3.8906, -4.2500, -1.7812,  2.5781, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -2.3906,  0.6523, -1.1953, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:52,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.80 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-4.2188, -2.4688,  1.7109,  2.7344, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250,  0.0356,  3.9375, -1.1875, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -1.6328,  1.1016,  0.5195, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5938, -3.5625, -0.3223,  3.8125, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -4.5000, -0.4473,  1.6172, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750, -1.7500,  1.3594,  1.8672, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:29:53,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:29:53,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.29 | bwd_microstep: 196.39 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 195.36 | step_microstep: 1.89
[2025-11-06 18:29:53,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.11 | bwd: 197.34 | bwd_inner: 1.80 | bwd_allreduce: 195.41 | step: 1.99
 53%|█████▎    | 1846/3507 [45:07<35:31,  1.28s/it]                                                   {'loss': 0.5107, 'learning_rate': 9.63520912319744e-06, 'epoch': 0.53}
 53%|█████▎    | 1846/3507 [45:07<35:31,  1.28s/it]tensor([[-5.4062, -3.1875,  1.2656,  1.3203, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -2.5312,  1.6250,  1.0625, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -3.8594, -0.0601,  2.0625, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -2.0000,  2.7656,  0.0562, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:53,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.01 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-3.8594, -2.4219,  0.7969,  1.7969, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594,  0.1338,  3.1719, -0.5117, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-7.9688, -6.2188, -0.9648,  0.9805, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.4375, -0.4062,  2.1875, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:29:54,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:29:54,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 317.13 | bwd_microstep: 444.71 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 443.58 | step_microstep: 1.89
[2025-11-06 18:29:54,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 598.16 | bwd: 445.81 | bwd_inner: 2.05 | bwd_allreduce: 443.63 | step: 1.98
 53%|█████▎    | 1847/3507 [45:08<33:54,  1.23s/it]                                                   {'loss': 0.9991, 'learning_rate': 9.625978166759612e-06, 'epoch': 0.53}
 53%|█████▎    | 1847/3507 [45:08<33:54,  1.23s/it]tensor([[-4.6250, -0.4824,  3.3906, -1.4688, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:54,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.63 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.7656, -0.5117,  2.9688, -0.0505, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -4.4375,  0.4434,  2.5781, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9141,  2.2656,  2.5156, -2.0000, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6250, -0.7461,  1.1016, -3.8281, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8125, -1.9141,  2.7969, -0.8281, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2188, -4.8750, -2.3438,  2.6094, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -2.4219,  0.8203, -0.4258, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:29:55,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:29:55,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.10 | bwd_microstep: 251.28 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 250.12 | step_microstep: 1.52
[2025-11-06 18:29:55,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 273.74 | bwd: 252.25 | bwd_inner: 1.92 | bwd_allreduce: 250.16 | step: 1.62
 53%|█████▎    | 1848/3507 [45:08<29:26,  1.06s/it]                                                   {'loss': 0.1983, 'learning_rate': 9.616747529463372e-06, 'epoch': 0.53}
 53%|█████▎    | 1848/3507 [45:08<29:26,  1.06s/it]tensor([[-5.0000, -4.0312,  0.0571,  2.9844, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -6.4062, -2.6250,  0.6562, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -0.2363,  3.3906, -2.5469, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -2.5156,  1.8828,  3.8125, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -2.6250,  0.7656,  1.6484, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:55,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.19 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.9375, -2.8594,  2.2344,  0.5352, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9062, -2.0625,  0.8555,  0.7188, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.6406,  1.1172,  0.7109, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:29:57,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:29:57,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.08 | bwd_microstep: 1313.43 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1312.37 | step_microstep: 1.71
[2025-11-06 18:29:57,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 482.28 | bwd: 1314.37 | bwd_inner: 1.84 | bwd_allreduce: 1312.40 | step: 1.79
 53%|█████▎    | 1849/3507 [45:11<38:24,  1.39s/it]                                                   {'loss': 0.269, 'learning_rate': 9.607517219184951e-06, 'epoch': 0.53}
 53%|█████▎    | 1849/3507 [45:11<38:24,  1.39s/it]tensor([[-2.4844,  0.4453,  3.1094,  0.1279, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:57,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.71 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4219, -3.6875, -1.5078,  2.2656, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125,  0.0664,  3.8594, -0.4492, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -1.6953,  1.5391, -0.9141, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7188, -3.4375,  2.1719,  0.4297, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -2.6094, -0.0082,  3.0000, -0.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -3.6250,  1.4453,  1.5469, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8125,  1.4609,  4.6562, -0.7305, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:57,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:29:57,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.18 | bwd_microstep: 2.34 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1.00 | step_microstep: 2.07
[2025-11-06 18:29:57,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.93 | bwd: 3.22 | bwd_inner: 2.03 | bwd_allreduce: 1.03 | step: 2.14
 53%|█████▎    | 1850/3507 [45:11<31:45,  1.15s/it]                                                   {'loss': 0.1587, 'learning_rate': 9.598287243800292e-06, 'epoch': 0.53}
 53%|█████▎    | 1850/3507 [45:11<31:45,  1.15s/it]tensor([[-5.5312, -4.4062, -0.0483,  2.4375, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -3.7188, -1.1562,  2.1719, -1.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.0312, -4.4688, -2.4688,  1.7344, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -5.1875, -1.8125,  1.3984, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -2.9062,  0.1934,  2.0469, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -3.9688,  0.2158,  1.7812, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -3.2969,  0.4941,  2.9062, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:58,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.67 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.2500, -4.5312,  0.1367, -0.5430, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:29:58,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.27 | optimizer_step: 0.27
[2025-11-06 18:29:58,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.79 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.30
[2025-11-06 18:29:58,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.48 | bwd: 2.85 | bwd_inner: 1.74 | bwd_allreduce: 0.95 | step: 2.41
 53%|█████▎    | 1851/3507 [45:12<29:26,  1.07s/it]                                                   {'loss': 0.6279, 'learning_rate': 9.589057611185058e-06, 'epoch': 0.53}
 53%|█████▎    | 1851/3507 [45:12<29:26,  1.07s/it]tensor([[-2.4688, -3.5000, -2.9062,  1.6328, -0.2295]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -1.1094,  3.2500, -0.6797, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -4.0938,  1.2891,  1.2422, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -4.2500,  0.9844,  1.9375, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:29:59,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 233.36 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.19
tensor([[-3.1250, -0.8672,  2.5625,  1.9844, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0000, -4.7812,  0.0071,  2.5938, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -4.4062,  1.0703,  1.5625, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -4.9375, -2.4688,  1.9297, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:00,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:30:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.11 | bwd_microstep: 421.70 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 420.66 | step_microstep: 2.45
[2025-11-06 18:30:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 424.47 | bwd: 422.76 | bwd_inner: 1.87 | bwd_allreduce: 420.71 | step: 2.65
 53%|█████▎    | 1852/3507 [45:14<36:34,  1.33s/it]                                                   {'loss': 0.6071, 'learning_rate': 9.57982832921461e-06, 'epoch': 0.53}
 53%|█████▎    | 1852/3507 [45:14<36:34,  1.33s/it]tensor([[-2.6875,  0.5312,  2.3594, -1.2344, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6875, -3.9531,  0.5234, -0.2451, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.6562,  0.0815,  2.9844, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:00,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.06 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7969, -3.3125, -1.5234,  2.5469, -0.6602]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -2.2812,  1.3750,  0.9102, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0312, -3.7969,  2.0469,  0.5273, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -2.4688,  0.9258, -0.3125, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -3.2031,  0.2520,  2.9219, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:30:01,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.28 | optimizer_step: 0.35
[2025-11-06 18:30:01,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.38 | bwd_microstep: 956.67 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 955.83 | step_microstep: 23.63
[2025-11-06 18:30:01,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.45 | bwd: 957.30 | bwd_inner: 1.29 | bwd_allreduce: 955.88 | step: 23.70
 53%|█████▎    | 1853/3507 [45:15<36:33,  1.33s/it]                                                   {'loss': 0.4729, 'learning_rate': 9.570599405764023e-06, 'epoch': 0.53}
 53%|█████▎    | 1853/3507 [45:15<36:33,  1.33s/it]tensor([[-1.7031,  2.7031,  3.7656, -2.7031, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -4.1562, -1.8672,  2.5469, -1.2891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:02,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.61 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.4688,  1.4531,  3.6250, -1.4922, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -2.4844,  1.1875,  2.2656, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -0.8320,  1.7969,  0.3281, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -3.4844,  0.8398,  1.1719, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -3.2969,  0.2334,  2.8125, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -4.0625, -0.3828,  2.7344, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:04,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:30:04,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.15 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.38
[2025-11-06 18:30:04,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.78 | bwd: 2.87 | bwd_inner: 1.71 | bwd_allreduce: 0.98 | step: 2.49
 53%|█████▎    | 1854/3507 [45:18<44:26,  1.61s/it]                                                   {'loss': 0.6719, 'learning_rate': 9.561370848708061e-06, 'epoch': 0.53}
 53%|█████▎    | 1854/3507 [45:18<44:26,  1.61s/it]tensor([[-2.8906,  0.3457,  3.0156, -0.0242, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3125, -1.9688, -1.4375,  2.0312,  0.3574]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:04,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.6641,  1.8359,  2.1406, -2.6406, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.3438, -3.9844, -0.1514,  1.4375, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7344, -2.0625,  1.5156,  2.3438, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4375, -4.1875,  1.1250,  1.5781, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4375, -5.4375, -1.2266,  1.5469, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -1.3281,  1.0938, -1.1562, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:06,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:30:06,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.00 | bwd_microstep: 1924.36 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 1923.47 | step_microstep: 2.30
[2025-11-06 18:30:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 464.67 | bwd: 1925.13 | bwd_inner: 1.48 | bwd_allreduce: 1923.51 | step: 2.39
 53%|█████▎    | 1855/3507 [45:20<51:13,  1.86s/it]                                                   {'loss': 0.4246, 'learning_rate': 9.552142665921172e-06, 'epoch': 0.53}
 53%|█████▎    | 1855/3507 [45:20<51:13,  1.86s/it]tensor([[-5.6562, -3.2031,  0.1113, -1.0234, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:06,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.96 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.1562, -3.7812,  0.1245,  1.7812, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -3.7500,  0.3750,  0.1514, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -4.7188, -1.1172,  2.0312, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3750, -4.7188,  0.3555,  2.2031, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -2.4688,  1.8984, -2.0312, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -5.1875, -1.4531,  2.3125, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -3.1406,  0.3789,  3.5469, -1.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:07,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:30:07,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.36 | bwd_microstep: 538.49 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 537.22 | step_microstep: 2.23
[2025-11-06 18:30:07,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.34 | bwd: 539.49 | bwd_inner: 2.07 | bwd_allreduce: 537.28 | step: 2.32
 53%|█████▎    | 1856/3507 [45:21<43:11,  1.57s/it]                                                   {'loss': 0.1767, 'learning_rate': 9.542914865277488e-06, 'epoch': 0.53}
 53%|█████▎    | 1856/3507 [45:21<43:11,  1.57s/it]tensor([[-1.0703,  2.2812,  2.8906, -1.5078, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -4.3125,  0.5781,  0.0972, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:07,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.82 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.5625, -3.5312,  0.5117,  2.9375, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.4219,  1.1562,  1.7500, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3047,  1.8281,  2.5469, -1.6953, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5938, -3.9688, -0.1123,  2.9219, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -3.2656,  1.8281,  0.7266, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344, -0.0605,  1.4844, -0.1689, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:08,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:30:08,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.71 | bwd_microstep: 755.62 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 754.57 | step_microstep: 1.88
[2025-11-06 18:30:08,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.55 | bwd: 756.69 | bwd_inner: 1.87 | bwd_allreduce: 754.63 | step: 1.99
 53%|█████▎    | 1857/3507 [45:22<39:23,  1.43s/it]                                                   {'loss': 0.7098, 'learning_rate': 9.533687454650816e-06, 'epoch': 0.53}
 53%|█████▎    | 1857/3507 [45:22<39:23,  1.43s/it]tensor([[-3.7812, -1.4219,  1.3203,  0.3867, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -1.6328,  1.9609,  0.1934, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -4.9062, -0.2793,  1.6406, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:08,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.15 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-5.0938, -2.2031,  1.1953, -0.9883, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219,  0.1021,  2.3125, -0.7031, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -2.8750,  2.0625, -0.3223, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.8750, -4.4688,  0.6914,  0.6250, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4531, -2.9062, -1.0703,  2.9062, -0.3613]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:30:09,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:30:09,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.80 | bwd_microstep: 511.85 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 510.68 | step_microstep: 1.88
[2025-11-06 18:30:09,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 464.98 | bwd: 512.98 | bwd_inner: 2.07 | bwd_allreduce: 510.75 | step: 2.00
 53%|█████▎    | 1858/3507 [45:23<36:16,  1.32s/it]                                                   {'loss': 0.6204, 'learning_rate': 9.524460441914621e-06, 'epoch': 0.53}
 53%|█████▎    | 1858/3507 [45:23<36:16,  1.32s/it]tensor([[-2.5469,  0.6016,  2.9844, -0.4277, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:09,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.55 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7500, -4.1250,  1.4062,  1.2656, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6562, -2.4219, -1.5547,  2.4688,  0.2773]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4688,  0.3262,  2.6875, -1.8750, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2969,  1.1875,  2.7031,  0.7773, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.1562, -2.0625,  1.0703,  0.9258, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -3.3594,  1.6797,  2.3750, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8906,  0.5977,  3.8750,  0.2256, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:11,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.22 | optimizer_step: 0.30
[2025-11-06 18:30:11,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.58 | bwd_microstep: 1932.28 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1931.23 | step_microstep: 4.03
[2025-11-06 18:30:11,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.16 | bwd: 1933.19 | bwd_inner: 1.78 | bwd_allreduce: 1931.28 | step: 4.11
 53%|█████▎    | 1859/3507 [45:25<43:43,  1.59s/it]                                                   {'loss': 0.516, 'learning_rate': 9.515233834942042e-06, 'epoch': 0.53}
 53%|█████▎    | 1859/3507 [45:25<43:43,  1.59s/it]tensor([[-2.8438, -3.6719, -2.4844,  1.8516, -0.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3125, -0.5469,  2.6875,  2.7031, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:12,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.88 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.8125, -6.5625, -0.9727, -0.1533, -6.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0000, -1.8906,  1.7031,  3.7812, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -0.1484,  2.0312, -1.5703, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -4.4062, -0.5703,  1.9062, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -5.6875, -2.2188,  2.0781, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -3.0469,  1.0703,  3.3125, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:30:12,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.31 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:30:12,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.08 | bwd_microstep: 362.87 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 361.76 | step_microstep: 3.69
[2025-11-06 18:30:12,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.98 | bwd: 363.77 | bwd_inner: 1.81 | bwd_allreduce: 361.80 | step: 3.77
 53%|█████▎    | 1860/3507 [45:26<36:46,  1.34s/it]                                                   {'loss': 0.5183, 'learning_rate': 9.506007641605866e-06, 'epoch': 0.53}
 53%|█████▎    | 1860/3507 [45:26<36:46,  1.34s/it]tensor([[-4.3125, -3.6562, -0.9805,  1.3828, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:12,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.71 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.6562, -3.5469, -0.1641,  3.8125, -1.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -3.0625, -0.3496,  3.7344, -0.6523]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5938, -1.9141,  1.3828,  4.0312, -0.8164]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.6367,  2.0312,  3.0938,  0.3242, -0.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0156, -0.5664,  2.5000,  1.0938, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.5938,  0.2852,  2.0781, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -0.2266,  3.2656, -2.4688, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:14,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.28 | optimizer_step: 0.29
[2025-11-06 18:30:14,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.04 | bwd_microstep: 1390.43 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1389.20 | step_microstep: 2.86
[2025-11-06 18:30:14,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.77 | bwd: 1391.39 | bwd_inner: 1.96 | bwd_allreduce: 1389.26 | step: 2.96
 53%|█████▎    | 1861/3507 [45:28<40:07,  1.46s/it]                                                   {'loss': 0.3122, 'learning_rate': 9.496781869778521e-06, 'epoch': 0.53}
 53%|█████▎    | 1861/3507 [45:28<40:07,  1.46s/it]tensor([[-8.3750, -5.8125, -0.6562, -0.9570, -6.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -2.0312,  1.8750,  1.6719, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -3.4688,  0.1963,  1.2578, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:14,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.91 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7812, -1.5859,  1.6016,  1.1016, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -3.4688,  1.0469,  2.4844, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -3.5000,  1.2656, -0.4902, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -1.6094,  3.1250, -1.0938, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -4.1250, -1.3906,  1.8672, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:15,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:30:15,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.05 | bwd_microstep: 687.79 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 686.74 | step_microstep: 2.49
[2025-11-06 18:30:15,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.00 | bwd: 688.62 | bwd_inner: 1.67 | bwd_allreduce: 686.78 | step: 2.58
 53%|█████▎    | 1862/3507 [45:29<37:17,  1.36s/it]                                                   {'loss': 0.7072, 'learning_rate': 9.48755652733209e-06, 'epoch': 0.53}
 53%|█████▎    | 1862/3507 [45:29<37:17,  1.36s/it]tensor([[-3.2031, -3.3594, -1.2969,  2.3750, -1.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -4.6250, -0.1074,  2.4844, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:15,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.18 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.1562, -3.0781,  0.7734,  0.9844, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -1.3359,  3.0781, -0.7383, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2812, -0.0918,  3.0312,  0.0889, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7500,  2.2031,  3.1562, -2.2812, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812, -4.1562,  0.4766, -0.1543, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.7188, -4.3125,  1.1562, -0.7422, -6.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:16,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.25 | optimizer_step: 0.20
[2025-11-06 18:30:16,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.43 | bwd_microstep: 722.60 | bwd_inner_microstep: 5.97 | bwd_allreduce_microstep: 716.52 | step_microstep: 2.34
[2025-11-06 18:30:16,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.64 | bwd: 723.65 | bwd_inner: 6.89 | bwd_allreduce: 716.58 | step: 2.44
 53%|█████▎    | 1863/3507 [45:30<35:10,  1.28s/it]                                                   {'loss': 0.3301, 'learning_rate': 9.47833162213827e-06, 'epoch': 0.53}
 53%|█████▎    | 1863/3507 [45:30<35:10,  1.28s/it]tensor([[ 0.0043,  2.5312,  1.7812, -1.2109, -0.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:30:16,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.33 | bwd_microstep: 6.38 | bwd_inner_microstep: 6.25 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -4.1250, -0.4785,  2.6094, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -2.9219,  0.8516,  1.1562, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -2.8281,  0.9727, -1.1953, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1250, -2.2031,  1.2891,  3.6875, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2656, -0.3359,  2.4844, -0.3008, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -2.9531,  0.3398,  1.0469, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -1.0000,  2.9219,  0.1660, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:30:19,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:30:19,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.98 | bwd_microstep: 1307.46 | bwd_inner_microstep: 2.53 | bwd_allreduce_microstep: 1304.83 | step_microstep: 3.39
[2025-11-06 18:30:19,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 265.32 | bwd: 1313.83 | bwd_inner: 8.79 | bwd_allreduce: 1304.89 | step: 3.48
 53%|█████▎    | 1864/3507 [45:33<45:37,  1.67s/it]                                                   {'loss': 0.5457, 'learning_rate': 9.469107162068399e-06, 'epoch': 0.53}
 53%|█████▎    | 1864/3507 [45:33<45:37,  1.67s/it]tensor([[-3.8281,  0.1172,  2.8281, -2.0938, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -2.5312,  0.3477,  2.5938, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:19,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.96 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-1.9922,  1.5781,  2.7656, -1.7109, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.9688, -0.9258,  2.7344, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -4.4375, -0.4785,  3.1094, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -1.6719,  2.6406, -0.7812, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -4.0938, -0.4648,  3.5469, -1.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -0.7383,  3.5469, -1.7734, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:20,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:30:20,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.21 | bwd_microstep: 1028.20 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1027.32 | step_microstep: 2.46
[2025-11-06 18:30:20,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.20 | bwd: 1029.20 | bwd_inner: 1.63 | bwd_allreduce: 1027.39 | step: 2.55
 53%|█████▎    | 1865/3507 [45:34<43:30,  1.59s/it]                                                   {'loss': 0.0791, 'learning_rate': 9.459883154993435e-06, 'epoch': 0.53}
 53%|█████▎    | 1865/3507 [45:34<43:30,  1.59s/it]tensor([[-5.1250, -4.1875, -0.0903,  2.4531, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3438, -1.1641,  2.5625,  2.4688, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -3.4219,  0.4004,  1.4219, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:20,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.88 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.8281, -2.2031,  1.5781,  2.6406, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -5.0625, -1.4297,  1.2344, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.6719, -0.1118,  1.9609, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -4.3438, -0.8086,  2.6094, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2500, -5.5312, -1.2891,  2.1406, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:23,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:30:23,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.32 | bwd_microstep: 714.71 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 713.71 | step_microstep: 2.92
[2025-11-06 18:30:23,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.21 | bwd: 715.56 | bwd_inner: 1.62 | bwd_allreduce: 713.76 | step: 3.03
 53%|█████▎    | 1866/3507 [45:37<51:44,  1.89s/it]                                                   {'loss': 0.341, 'learning_rate': 9.450659608783945e-06, 'epoch': 0.53}
 53%|█████▎    | 1866/3507 [45:37<51:44,  1.89s/it]tensor([[-7.5000, -4.4688, -1.5078, -3.5781, -6.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7656, -3.3438, -1.1328,  3.4844, -0.4473]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:23,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.61 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -5.3750, -2.4062,  2.4844, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -1.9141,  2.3906, -0.3730, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -3.8438,  0.4141,  0.8711, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -0.7930,  3.7031, -2.0625, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.8750, -3.9844,  0.5547,  1.3594, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -3.8750, -0.1680,  4.2812, -1.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:30:24,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.20 | optimizer_step: 0.25
[2025-11-06 18:30:24,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.17 | bwd_microstep: 750.08 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 749.16 | step_microstep: 2.27
[2025-11-06 18:30:24,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 427.81 | bwd: 750.85 | bwd_inner: 1.49 | bwd_allreduce: 749.21 | step: 2.35
 53%|█████▎    | 1867/3507 [45:38<46:12,  1.69s/it]                                                   {'loss': 1.0575, 'learning_rate': 9.4414365313101e-06, 'epoch': 0.53}
 53%|█████▎    | 1867/3507 [45:38<46:12,  1.69s/it]tensor([[-5.1875, -4.9062, -1.5703,  1.8906, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -4.9062, -0.5117,  2.6250, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406, -3.2031, -1.4688,  2.7344, -0.4336]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:24,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.22 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.22
tensor([[-7.9688, -7.0000, -1.9297,  1.6250, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.0781,  0.0598,  0.6094, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031, -2.9375,  0.1875,  2.5312, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -4.4375, -2.4375,  2.1094, -1.3047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.0000, -5.7188, -0.1348,  0.5508, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:30:26,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 9.84 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:30:26,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.58 | bwd_microstep: 118.62 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 117.22 | step_microstep: 12.34
[2025-11-06 18:30:26,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.83 | bwd: 119.52 | bwd_inner: 2.06 | bwd_allreduce: 117.28 | step: 12.57
 53%|█████▎    | 1868/3507 [45:40<46:29,  1.70s/it]                                                   {'loss': 0.3086, 'learning_rate': 9.43221393044168e-06, 'epoch': 0.53}
 53%|█████▎    | 1868/3507 [45:40<46:29,  1.70s/it]tensor([[-4.9375, -5.4375, -2.7656,  2.3438, -1.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -3.5625, -1.9297,  2.6094, -0.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062,  1.5156,  4.1875, -1.1641, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6875,  0.7148,  2.9688, -1.2500, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -4.1562,  0.7852,  1.9688, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:26,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 278.17 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.4375, -4.5625,  0.0471,  3.0938, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -0.2734,  0.8320, -3.7812, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -3.9844, -0.1523,  2.7656, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:26,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.86 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:30:26,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.60 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.79
[2025-11-06 18:30:26,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.79 | bwd: 2.60 | bwd_inner: 1.61 | bwd_allreduce: 0.84 | step: 2.89
 53%|█████▎    | 1869/3507 [45:40<36:23,  1.33s/it]                                                   {'loss': 0.1198, 'learning_rate': 9.422991814048051e-06, 'epoch': 0.53}
 53%|█████▎    | 1869/3507 [45:40<36:23,  1.33s/it]tensor([[-5.9688, -4.0312,  1.3672,  2.7812, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:26,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.82 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1250, -3.8281, -1.2656,  2.1250, -1.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7656, -4.1562, -1.7031,  2.6250, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7070,  2.3750,  2.9375, -1.3750, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.3438, -1.7891,  2.6250, -0.4941, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -2.4062,  1.1875,  1.9609, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -0.1562,  2.3438, -1.5625, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.3906,  0.2314,  2.3594, -2.0156, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:30:29,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:30:29,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.40 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.72 | step_microstep: 2.15
[2025-11-06 18:30:29,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.22 | bwd: 2.59 | bwd_inner: 1.71 | bwd_allreduce: 0.76 | step: 2.24
 53%|█████▎    | 1870/3507 [45:43<51:33,  1.89s/it]                                                   {'loss': 1.4348, 'learning_rate': 9.413770189998165e-06, 'epoch': 0.53}
 53%|█████▎    | 1870/3507 [45:43<51:33,  1.89s/it]tensor([[-6.0938, -2.5156,  1.2344, -1.9844, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -2.9688,  0.0811,  3.5312, -1.1172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -1.0156,  2.0469,  0.2412, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:30,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.07 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.2500, -0.8984,  1.3203,  1.9688, -1.1797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -2.1562,  2.2969, -0.0645, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938,  0.2539,  3.4219, -0.9375, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -4.6875, -0.3457,  3.6719, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3750, -3.1875, -0.2383,  3.2969, -1.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:30,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:30:30,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.54 | bwd_microstep: 18.62 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 17.64 | step_microstep: 1.76
[2025-11-06 18:30:30,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.63 | bwd: 19.47 | bwd_inner: 1.66 | bwd_allreduce: 17.68 | step: 1.83
 53%|█████▎    | 1871/3507 [45:44<39:16,  1.44s/it]                                                   {'loss': 0.1273, 'learning_rate': 9.40454906616056e-06, 'epoch': 0.53}
 53%|█████▎    | 1871/3507 [45:44<39:16,  1.44s/it]tensor([[-4.7500, -4.5312, -0.8789,  2.8281, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.4062,  0.3516,  3.1094, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -1.6875,  1.8516, -0.1865, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:30,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.80 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-6.0000, -3.8906,  0.7969,  1.0469, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -4.4062, -2.5625,  1.6953, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -5.0938, -0.8086,  0.8672, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -4.1250,  0.3984,  1.1562, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2500, -6.3125, -2.8438,  1.8672, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:32,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:30:32,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.36 | bwd_microstep: 716.06 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 715.08 | step_microstep: 1.94
[2025-11-06 18:30:32,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.20 | bwd: 717.11 | bwd_inner: 1.77 | bwd_allreduce: 715.13 | step: 2.04
 53%|█████▎    | 1872/3507 [45:46<47:03,  1.73s/it]                                                   {'loss': 0.1854, 'learning_rate': 9.395328450403342e-06, 'epoch': 0.53}
 53%|█████▎    | 1872/3507 [45:46<47:03,  1.73s/it]tensor([[-3.3438, -0.3848,  2.4688,  0.2793, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:32,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.69 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.3438, -2.7344,  1.7969,  0.4043, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.4219,  1.9609,  2.8594,  0.5352, -0.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -4.6875,  0.8555,  1.1328, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2109, -1.7656, -1.7422,  0.9336,  0.2061]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.7812, -0.9023,  2.2344,  0.0249, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-3.7031, -1.3047,  3.3594,  2.9844, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([2], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-4.0625, -4.3125, -1.3594,  2.7969, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:30:33,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:30:33,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.57 | bwd_microstep: 130.02 | bwd_inner_microstep: 5.09 | bwd_allreduce_microstep: 124.84 | step_microstep: 1.76
[2025-11-06 18:30:33,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.27 | bwd: 130.72 | bwd_inner: 5.69 | bwd_allreduce: 124.88 | step: 1.85
 53%|█████▎    | 1873/3507 [45:46<36:47,  1.35s/it]                                                   {'loss': 0.4726, 'learning_rate': 9.38610835059419e-06, 'epoch': 0.53}
 53%|█████▎    | 1873/3507 [45:46<36:47,  1.35s/it]tensor([[-2.7812,  1.1328,  3.2812, -2.1250, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:33,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.25 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -3.5781, -0.5352,  1.9766, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -1.7500,  1.9766,  0.2139, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -4.2500,  0.9648,  2.1094, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -1.5312,  3.0312, -0.7031, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -6.5312, -2.4844,  1.1953, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0000, -3.4844, -0.9141,  3.7344, -0.5898]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -2.9688,  0.7812,  1.7656, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:35,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:30:35,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.81 | bwd_microstep: 2051.37 | bwd_inner_microstep: 5.74 | bwd_allreduce_microstep: 2045.52 | step_microstep: 2.91
[2025-11-06 18:30:35,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 259.08 | bwd: 2052.08 | bwd_inner: 6.34 | bwd_allreduce: 2045.58 | step: 2.99
 53%|█████▎    | 1874/3507 [45:49<44:54,  1.65s/it]                                                   {'loss': 0.2803, 'learning_rate': 9.37688877460033e-06, 'epoch': 0.53}
 53%|█████▎    | 1874/3507 [45:49<44:54,  1.65s/it]tensor([[-5.0938, -4.8125, -1.0156,  2.8594, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -2.4688,  1.1016,  0.8711, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -4.0938,  0.1807,  2.8125, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0312, -4.5938,  0.1709,  2.2812, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:35,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.61 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-9.0625, -7.4688, -2.1094, -0.0131, -6.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062e+00, -4.3125e+00, -4.5776e-03,  1.4551e-01, -4.5625e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -2.9375,  0.6641,  0.8711, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -2.6719,  1.1641,  2.2969, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:35,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.20 | optimizer_step: 0.31
[2025-11-06 18:30:35,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.89 | bwd_microstep: 2.56 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 1.12 | step_microstep: 2.11
[2025-11-06 18:30:35,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.52 | bwd: 3.55 | bwd_inner: 2.19 | bwd_allreduce: 1.17 | step: 2.21
 53%|█████▎    | 1875/3507 [45:49<34:42,  1.28s/it]                                                   {'loss': 0.3841, 'learning_rate': 9.367669730288555e-06, 'epoch': 0.53}
 53%|█████▎    | 1875/3507 [45:49<34:42,  1.28s/it]tensor([[-5.0938, -4.4062, -0.7891,  2.0938, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -0.4707,  2.6875, -1.6484, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:36,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.27 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-3.3906,  0.4336,  3.2031, -1.7656, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -2.1875,  2.4688,  0.1191, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -3.6406,  0.2734,  3.2500, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -2.1562,  1.4141,  0.7031, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -3.4219,  0.8164,  0.5117, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -2.2812,  1.8984,  1.1797, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:30:38,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.23 | optimizer_step: 0.27
[2025-11-06 18:30:38,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.07 | bwd_microstep: 2088.13 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 2087.18 | step_microstep: 2.48
[2025-11-06 18:30:38,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.34 | bwd: 2088.93 | bwd_inner: 1.57 | bwd_allreduce: 2087.22 | step: 2.56
 53%|█████▎    | 1876/3507 [45:52<44:48,  1.65s/it]                                                   {'loss': 0.2185, 'learning_rate': 9.358451225525197e-06, 'epoch': 0.53}
 53%|█████▎    | 1876/3507 [45:52<44:48,  1.65s/it]tensor([[-1.6875,  1.9141,  2.6250, -2.6719, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6875, -4.8438, -0.1787,  3.0938, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -5.5625, -2.7031,  1.6406, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9062, -5.8438, -1.3750,  1.4922, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:38,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.65 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.5625, -4.2188,  0.5938, -1.6719, -6.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1250, -4.3438,  0.7773,  0.1206, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -1.2656,  3.4375,  0.4746, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -3.4531,  1.5547,  0.3125, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:30:38,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:30:38,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.48 | bwd_microstep: 96.64 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 95.63 | step_microstep: 1.54
[2025-11-06 18:30:38,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.13 | bwd: 97.43 | bwd_inner: 1.60 | bwd_allreduce: 95.68 | step: 1.63
 54%|█████▎    | 1877/3507 [45:52<35:25,  1.30s/it]                                                   {'loss': 0.3447, 'learning_rate': 9.349233268176127e-06, 'epoch': 0.54}
 54%|█████▎    | 1877/3507 [45:52<35:25,  1.30s/it]tensor([[-4.4688, -5.0312, -2.3281,  2.6562, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.7344, -0.3008,  2.9844, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -1.1719,  2.3906, -1.6094, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:39,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.35 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.9062, -5.5625, -1.5781,  0.1865, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9375, -4.9375,  0.8359,  1.9453, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.8125e+00, -6.0938e+00,  2.1057e-03,  1.5991e-02, -6.4062e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -4.3125, -1.1484,  2.2500, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0312, -4.5312,  1.2812,  1.4922, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:42,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:30:42,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.30 | bwd_microstep: 2659.67 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 2658.19 | step_microstep: 2.16
[2025-11-06 18:30:42,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 449.67 | bwd: 2660.67 | bwd_inner: 2.30 | bwd_allreduce: 2658.24 | step: 2.24
 54%|█████▎    | 1878/3507 [45:55<50:26,  1.86s/it]                                                   {'loss': 0.2392, 'learning_rate': 9.340015866106755e-06, 'epoch': 0.54}
 54%|█████▎    | 1878/3507 [45:55<50:26,  1.86s/it]tensor([[-2.4844, -3.5000, -2.0312,  3.0469, -0.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:42,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.54 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1875, -2.2344,  0.9531,  0.4648, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -3.0625,  0.4648,  2.0469, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -3.7344,  0.3281,  3.0625, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -4.2500, -0.8477,  0.5742, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8438, -0.7812,  1.6250, -1.0156, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -4.3438, -0.6758,  3.5938, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -0.0209,  3.5781, -1.3594, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:42,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:30:42,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.31 | bwd_microstep: 138.96 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 137.87 | step_microstep: 1.91
[2025-11-06 18:30:42,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 268.87 | bwd: 139.92 | bwd_inner: 1.88 | bwd_allreduce: 137.91 | step: 1.99
 54%|█████▎    | 1879/3507 [45:56<38:51,  1.43s/it]                                                   {'loss': 0.1613, 'learning_rate': 9.330799027182015e-06, 'epoch': 0.54}
 54%|█████▎    | 1879/3507 [45:56<38:51,  1.43s/it]tensor([[-5.5938, -4.8438, -0.6094,  2.8750, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:42,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.90 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9531, -3.5938, -2.3906,  1.6562, -0.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -0.9961,  2.9062, -0.0076, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5625, -3.5156, -0.6562,  3.0000, -1.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4844, -1.7266,  1.8594,  4.6562, -0.6602]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -5.5000, -2.8125,  0.9844, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -4.6875, -0.5977,  2.9062, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.6562, -0.9805,  1.9141, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:44,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:30:44,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.38 | bwd_microstep: 1247.70 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1246.63 | step_microstep: 1.76
[2025-11-06 18:30:44,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.30 | bwd: 1248.74 | bwd_inner: 1.92 | bwd_allreduce: 1246.67 | step: 1.84
 54%|█████▎    | 1880/3507 [45:57<40:23,  1.49s/it]                                                   {'loss': 0.0594, 'learning_rate': 9.32158275926635e-06, 'epoch': 0.54}
 54%|█████▎    | 1880/3507 [45:57<40:23,  1.49s/it]tensor([[-5.4375, -1.8984,  2.2656, -1.1094, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.3438, -4.0938,  0.9336, -1.0312, -5.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -2.3750,  1.0000,  0.4473, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -2.0625,  2.4375,  0.3418, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:44,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.13 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -3.6719,  0.7031,  2.8594, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -2.9844,  0.4141,  4.8125, -0.6523]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0078,  2.1719,  2.7188, -1.7031, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5781,  0.4668,  3.2969, -1.3828, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:44,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.23
[2025-11-06 18:30:44,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.09 | bwd_microstep: 1.97 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.96 | step_microstep: 1.95
[2025-11-06 18:30:44,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.25 | bwd: 2.88 | bwd_inner: 1.74 | bwd_allreduce: 1.00 | step: 2.02
 54%|█████▎    | 1881/3507 [45:58<31:45,  1.17s/it]                                                   {'loss': 0.2539, 'learning_rate': 9.31236707022373e-06, 'epoch': 0.54}
 54%|█████▎    | 1881/3507 [45:58<31:45,  1.17s/it]tensor([[-6.7500, -4.9062,  0.4805,  1.8672, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4688,  0.9883,  2.8281, -1.3672, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5000, -4.2188,  1.2500,  1.5000, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:44,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.31 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0000,  0.8477,  2.7656, -2.1719, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812,  0.9180,  3.2031, -1.5000, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0312, -4.0312, -0.8594,  3.0469, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -2.0781,  1.0234,  1.7578, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -1.5703,  1.8125,  0.5977, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:30:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.31 | optimizer_step: 0.29
[2025-11-06 18:30:47,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.74 | bwd_microstep: 2763.79 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 2762.91 | step_microstep: 3.73
[2025-11-06 18:30:47,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 459.09 | bwd: 2764.84 | bwd_inner: 1.70 | bwd_allreduce: 2762.98 | step: 3.81
 54%|█████▎    | 1882/3507 [46:01<48:48,  1.80s/it]                                                   {'loss': 0.2781, 'learning_rate': 9.303151967917626e-06, 'epoch': 0.54}
 54%|█████▎    | 1882/3507 [46:01<48:48,  1.80s/it]tensor([[-1.4609, -2.2188, -0.7578,  3.6250,  0.5273]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -6.3750, -3.0469,  0.7422, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -5.6562, -1.4844, -0.3691, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -1.9531,  1.8750,  1.6875, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:48,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.07 | bwd_microstep: 1.13 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.5000, -2.4688,  1.2344,  1.3984, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.4062, -3.4062,  0.3691,  2.5312, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([3], device='cuda:2')
tensor([[-5.0938, -3.7188,  0.1553,  1.4922, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -1.3750,  1.8594,  0.9180, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:30:48,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.30 | optimizer_step: 0.20
[2025-11-06 18:30:48,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.11 | bwd_microstep: 100.73 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 99.73 | step_microstep: 7.66
[2025-11-06 18:30:48,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.22 | bwd: 101.85 | bwd_inner: 1.89 | bwd_allreduce: 99.78 | step: 7.76
 54%|█████▎    | 1883/3507 [46:02<38:10,  1.41s/it]                                                   {'loss': 0.3412, 'learning_rate': 9.293937460211005e-06, 'epoch': 0.54}
 54%|█████▎    | 1883/3507 [46:02<38:10,  1.41s/it]tensor([[-5.5312, -4.0000,  0.3711,  1.8047, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:48,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.47 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1875, -1.4688,  1.3359, -0.6445, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -2.0469,  2.6562, -0.5742, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -3.3906,  0.0173,  1.7578, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5625, -3.6406,  1.7891,  0.5977, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4004,  3.2656,  4.6875, -0.6680, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.9062, -6.0938, -1.3672,  2.2344, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1875, -6.7500, -1.0625,  1.6016, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:30:51,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:30:51,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.44 | bwd_microstep: 2879.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2878.00 | step_microstep: 2.40
[2025-11-06 18:30:51,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 276.91 | bwd: 2879.86 | bwd_inner: 1.64 | bwd_allreduce: 2878.05 | step: 2.49
 54%|█████▎    | 1884/3507 [46:05<52:51,  1.95s/it]                                                   {'loss': 0.4808, 'learning_rate': 9.284723554966335e-06, 'epoch': 0.54}
 54%|█████▎    | 1884/3507 [46:05<52:51,  1.95s/it]tensor([[-5.2500, -3.5625,  0.4512,  1.2891, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -6.0312, -1.3594,  2.7656, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -4.3750, -2.3281,  2.2812, -1.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.6875, -0.2012,  1.8750, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:51,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.0938, -1.1562,  2.8438,  0.5820, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -4.1562, -0.8555,  3.3750, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3125, -6.3125, -1.6641, -0.9844, -6.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -0.8711,  3.3750, -1.0938, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:52,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:30:52,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.59 | bwd_microstep: 87.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 87.02 | step_microstep: 2.10
[2025-11-06 18:30:52,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 431.90 | bwd: 88.91 | bwd_inner: 1.67 | bwd_allreduce: 87.08 | step: 2.21
 54%|█████▎    | 1885/3507 [46:05<41:36,  1.54s/it]                                                   {'loss': 0.4254, 'learning_rate': 9.27551026004556e-06, 'epoch': 0.54}
 54%|█████▎    | 1885/3507 [46:05<41:36,  1.54s/it]tensor([[-3.2656, -2.2812,  0.6758,  2.1875, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3438, -5.1250, -0.3711,  2.1562, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:30:52,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.03 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -0.2578,  2.3438, -3.0781, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -3.7188, -0.4746,  1.9766, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -4.1562, -0.3301,  0.5938, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -2.9844,  1.3516,  1.3984, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -2.9219,  2.0938,  0.7734, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5469,  0.6875,  2.0000, -1.9375, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:30:53,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:30:53,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.72 | bwd_microstep: 836.61 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 835.48 | step_microstep: 2.25
[2025-11-06 18:30:53,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.78 | bwd: 837.30 | bwd_inner: 1.62 | bwd_allreduce: 835.52 | step: 2.33
 54%|█████▍    | 1886/3507 [46:07<38:44,  1.43s/it]                                                   {'loss': 0.4481, 'learning_rate': 9.266297583310106e-06, 'epoch': 0.54}
 54%|█████▍    | 1886/3507 [46:07<38:44,  1.43s/it]tensor([[-7.2188, -3.3750,  2.0625, -1.1328, -6.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:53,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.05 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -4.2188, -0.7383,  2.7656, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -2.6562,  1.2969,  1.6016, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -3.2188, -1.3594,  2.8906, -0.6133]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -2.7656,  2.0781,  1.0625, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -4.5312, -1.3750,  2.6562, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.8438, -0.8242,  2.3906, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.8750, -1.1562,  2.1250, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:30:54,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 18:30:54,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.54 | bwd_microstep: 1182.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1181.99 | step_microstep: 2.28
[2025-11-06 18:30:54,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.63 | bwd: 1183.57 | bwd_inner: 1.38 | bwd_allreduce: 1182.04 | step: 2.36
 54%|█████▍    | 1887/3507 [46:08<39:54,  1.48s/it]                                                   {'loss': 0.1414, 'learning_rate': 9.257085532620875e-06, 'epoch': 0.54}
 54%|█████▍    | 1887/3507 [46:08<39:54,  1.48s/it]tensor([[-4.3750,  0.1279,  4.2500, -1.5078, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -3.0000,  1.6328,  0.8945, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -1.3359,  2.6719, -0.1040, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:55,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.93 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -2.8281,  1.0938,  2.4688, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -4.2188, -0.0106,  2.3281, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -3.9219,  0.8125,  1.7891, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7109, -2.2500, -1.0156,  2.7188,  0.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.6562, -2.8281,  0.8945, -1.1641, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:57,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:30:57,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.79 | bwd_microstep: 1834.95 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1834.06 | step_microstep: 2.78
[2025-11-06 18:30:57,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.74 | bwd: 1835.77 | bwd_inner: 1.52 | bwd_allreduce: 1834.10 | step: 2.86
 54%|█████▍    | 1888/3507 [46:10<46:03,  1.71s/it]                                                   {'loss': 1.0835, 'learning_rate': 9.247874115838236e-06, 'epoch': 0.54}
 54%|█████▍    | 1888/3507 [46:10<46:03,  1.71s/it]tensor([[-3.5469, -2.3594,  1.0156,  2.2500, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:57,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.92 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.7188, -4.9688,  0.3594,  1.8906, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0625, -4.4062, -0.4805, -1.5234, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -3.7656, -0.1196,  2.3594, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.0000,  0.5508,  1.4844, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -4.1250, -0.6797,  2.9844, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6602,  1.2969,  3.4062,  2.7812, -0.2676]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -2.2344,  1.4062,  0.3301, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:30:58,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:30:58,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.66 | bwd_microstep: 700.32 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 699.34 | step_microstep: 2.10
[2025-11-06 18:30:58,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 473.63 | bwd: 701.22 | bwd_inner: 1.66 | bwd_allreduce: 699.39 | step: 2.19
 54%|█████▍    | 1889/3507 [46:12<42:06,  1.56s/it]                                                   {'loss': 0.4188, 'learning_rate': 9.23866334082201e-06, 'epoch': 0.54}
 54%|█████▍    | 1889/3507 [46:12<42:06,  1.56s/it]tensor([[-4.7500, -3.6562,  0.5000,  3.0312, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -4.0312, -0.1357,  3.2656, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -2.6719,  2.0938,  1.1328, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:30:58,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 282.30 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8438, -4.2812, -0.2754,  3.3125, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -4.3750, -0.3281,  3.2188, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6562, -1.2344,  1.6094,  0.1299, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -4.0312, -1.4375,  2.1562, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -3.6094,  1.7578, -0.2676, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:31:00,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:31:00,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.94 | bwd_microstep: 1659.42 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 1658.59 | step_microstep: 1.66
[2025-11-06 18:31:00,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 452.27 | bwd: 1660.39 | bwd_inner: 1.60 | bwd_allreduce: 1658.64 | step: 1.75
 54%|█████▍    | 1890/3507 [46:14<46:55,  1.74s/it]                                                   {'loss': 0.5582, 'learning_rate': 9.22945321543148e-06, 'epoch': 0.54}
 54%|█████▍    | 1890/3507 [46:14<46:55,  1.74s/it]tensor([[-2.3281, -2.8906, -2.0781,  1.2344, -0.4805]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -3.7344,  1.0234,  1.0625, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:00,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.36 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2188, -4.0625,  0.9219,  1.6641, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -2.4531,  2.0938,  0.6914, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -4.4375, -1.4297,  2.1719, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -3.4531,  0.4941,  1.2969, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -2.6562,  2.3125,  0.6836, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -1.3438,  2.4844, -1.0938, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:31:01,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:31:01,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.44 | bwd_microstep: 887.17 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 886.16 | step_microstep: 2.78
[2025-11-06 18:31:01,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.82 | bwd: 888.04 | bwd_inner: 1.71 | bwd_allreduce: 886.20 | step: 2.86
 54%|█████▍    | 1891/3507 [46:15<42:34,  1.58s/it]                                                   {'loss': 0.367, 'learning_rate': 9.220243747525363e-06, 'epoch': 0.54}
 54%|█████▍    | 1891/3507 [46:15<42:34,  1.58s/it]tensor([[-3.2031,  0.0859,  2.5312, -1.0391, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4688, -3.3906, -1.9844,  2.7500, -0.1914]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.3438, -2.4688,  0.8828, -0.8711, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:01,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.33 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3438, -1.6641,  3.0625,  2.2656, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.5781,  0.8906,  2.0000, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -3.8125,  0.3066,  1.9219, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -4.4375, -1.8359,  2.2500, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -2.0312,  1.6406,  0.3926, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:31:02,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:31:02,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.52 | bwd_microstep: 446.30 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 445.14 | step_microstep: 2.14
[2025-11-06 18:31:02,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.87 | bwd: 447.41 | bwd_inner: 2.08 | bwd_allreduce: 445.18 | step: 2.22
 54%|█████▍    | 1892/3507 [46:16<36:23,  1.35s/it]                                                   {'loss': 0.801, 'learning_rate': 9.211034944961825e-06, 'epoch': 0.54}
 54%|█████▍    | 1892/3507 [46:16<36:23,  1.35s/it]tensor([[-4.0312, -1.5781,  2.2188,  1.1797, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:02,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.26 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.4062, -3.5312,  0.5664,  1.0391, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9805,  2.2969,  2.5938, -1.8281, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.1094, -1.0938,  2.8125,  3.1719, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -5.0938, -1.5312,  2.7812, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0938,  0.9023,  3.5625,  1.0391, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -4.0938,  0.1650,  1.5781, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[ 0.5664,  3.7812,  5.5938,  1.5469, -0.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:31:04,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.30 | optimizer_step: 0.39
[2025-11-06 18:31:04,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.89 | bwd_microstep: 1493.48 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1492.38 | step_microstep: 2.73
[2025-11-06 18:31:04,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.16 | bwd: 1494.52 | bwd_inner: 1.94 | bwd_allreduce: 1492.43 | step: 2.81
 54%|█████▍    | 1893/3507 [46:18<43:14,  1.61s/it]                                                   {'loss': 0.4105, 'learning_rate': 9.201826815598455e-06, 'epoch': 0.54}
 54%|█████▍    | 1893/3507 [46:18<43:14,  1.61s/it]tensor([[-6.8750, -4.9688, -0.4355,  0.4355, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[5.7500, 7.5312, 8.3125, 6.7812, 4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:31:04,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.4375, -1.5312,  2.7344, -1.4766, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4531, -0.4922,  2.4844, -0.4805, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250, -0.0430,  2.4062, -1.7344, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5312, -3.4375,  0.6836,  2.9844, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.2812, -0.5430,  3.0938, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8984,  1.8516,  3.0469, -2.2656, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:31:06,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:31:06,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.67 | bwd_microstep: 917.42 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 916.31 | step_microstep: 1.94
[2025-11-06 18:31:06,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.04 | bwd: 918.30 | bwd_inner: 1.83 | bwd_allreduce: 916.34 | step: 2.01
 54%|█████▍    | 1894/3507 [46:19<40:34,  1.51s/it]                                                   {'loss': 0.9521, 'learning_rate': 9.192619367292281e-06, 'epoch': 0.54}
 54%|█████▍    | 1894/3507 [46:19<40:34,  1.51s/it]tensor([[-5.0000, -2.9062,  1.2109,  1.1953, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:06,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.68 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8438, -3.0469,  0.5547,  0.9570, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.5391,  2.5156,  5.3750,  2.2031, -0.8477]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.0312,  0.2461,  0.6680, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -1.3203,  2.4844, -0.9141, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -5.0312, -1.5469,  2.8594, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -4.1875,  0.3086,  2.0781, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688, -3.5469, -0.6836,  3.2031, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:31:08,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:31:08,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.66 | bwd_microstep: 1995.42 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1994.19 | step_microstep: 2.05
[2025-11-06 18:31:08,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.36 | bwd: 1996.24 | bwd_inner: 1.87 | bwd_allreduce: 1994.24 | step: 2.14
 54%|█████▍    | 1895/3507 [46:22<47:36,  1.77s/it]                                                   {'loss': 0.3196, 'learning_rate': 9.183412607899741e-06, 'epoch': 0.54}
 54%|█████▍    | 1895/3507 [46:22<47:36,  1.77s/it]tensor([[-7.9688, -5.7500, -0.3633,  0.5078, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:08,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.38 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6250, -3.2500, -0.1709,  2.7812, -1.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.6719,  0.4551,  2.3750, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -4.6562, -0.0579,  2.0469, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -5.5938, -1.9453,  2.8594, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -3.9844, -0.2100,  0.8516, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.8906,  1.2109,  1.0781, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -0.2305,  2.9688, -2.1250, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:31:08,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:31:08,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.58 | bwd_microstep: 220.20 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 219.14 | step_microstep: 1.61
[2025-11-06 18:31:08,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 196.98 | bwd: 221.06 | bwd_inner: 1.75 | bwd_allreduce: 219.18 | step: 1.69
 54%|█████▍    | 1896/3507 [46:22<36:52,  1.37s/it]                                                   {'loss': 0.3382, 'learning_rate': 9.174206545276678e-06, 'epoch': 0.54}
 54%|█████▍    | 1896/3507 [46:22<36:52,  1.37s/it]tensor([[-4.7188, -0.4121,  3.3906, -1.7188, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:09,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.24 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.2656,  2.1875,  3.0312, -1.8125, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625e+00, -3.9062e+00, -1.4496e-04,  1.9375e+00, -3.0156e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -3.2188,  2.2969,  0.7656, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -4.6562, -1.2734,  2.3125, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000e+00, -4.7188e+00,  3.8147e-03,  2.4531e+00, -3.6094e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -1.8594,  0.9180, -1.9062, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -3.2969,  0.7617,  3.7812, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:31:10,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:31:10,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.36 | bwd_microstep: 1346.38 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1345.22 | step_microstep: 1.62
[2025-11-06 18:31:10,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.62 | bwd: 1347.37 | bwd_inner: 1.99 | bwd_allreduce: 1345.26 | step: 1.69
 54%|█████▍    | 1897/3507 [46:24<39:42,  1.48s/it]                                                   {'loss': 0.3227, 'learning_rate': 9.165001187278357e-06, 'epoch': 0.54}
 54%|█████▍    | 1897/3507 [46:24<39:42,  1.48s/it]tensor([[-7.1250, -5.3438, -0.1260,  1.3047, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -4.3438, -1.8594,  2.8750, -1.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -4.6562, -0.7852,  1.7734, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:10,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.60 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2031, -3.0469, -0.9453,  1.6953, -1.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -4.7500, -2.9844,  1.2344, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8125, -0.2266,  1.4219,  1.1797, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2812, -4.2188, -0.1729,  2.1719, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.2344,  0.5586,  1.4766, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:31:11,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:31:11,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.26 | bwd_microstep: 82.00 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 80.92 | step_microstep: 2.13
[2025-11-06 18:31:11,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.88 | bwd: 82.93 | bwd_inner: 1.83 | bwd_allreduce: 80.96 | step: 2.22
 54%|█████▍    | 1898/3507 [46:24<31:25,  1.17s/it]                                                   {'loss': 0.2444, 'learning_rate': 9.155796541759429e-06, 'epoch': 0.54}
 54%|█████▍    | 1898/3507 [46:24<31:25,  1.17s/it]tensor([[-7.9688, -7.1250, -2.2812,  1.1953, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:11,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.23 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3750,  0.9531,  3.5000, -2.3750, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.8789,  2.4375,  3.6406, -0.5938, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -3.8750, -0.2637,  2.7969, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -2.5625,  0.7148, -1.9297, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -2.5625,  2.2500, -0.0229, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5312, -3.6719,  0.5625,  0.9180, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -4.3125, -1.6797,  2.0312, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:31:13,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:31:13,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.24 | bwd_microstep: 2357.30 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 2355.81 | step_microstep: 2.23
[2025-11-06 18:31:13,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.50 | bwd: 2358.02 | bwd_inner: 2.00 | bwd_allreduce: 2355.87 | step: 2.31
 54%|█████▍    | 1899/3507 [46:27<43:54,  1.64s/it]                                                   {'loss': 0.2013, 'learning_rate': 9.146592616573942e-06, 'epoch': 0.54}
 54%|█████▍    | 1899/3507 [46:27<43:54,  1.64s/it]tensor([[-6.7500, -5.0312,  0.2188,  1.8984, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -4.0938, -1.0547,  2.4062, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:13,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.74 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4688, -1.0312,  2.6562, -0.3672, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -4.0938, -0.6602,  2.1719, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -1.3906,  2.2031, -0.4590, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -1.9844,  0.3418, -2.5938, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3750, -4.4062,  1.0469,  2.0469, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -2.6250,  1.9922, -0.6016, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:31:14,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:31:14,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.91 | bwd_microstep: 208.66 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 207.66 | step_microstep: 1.54
[2025-11-06 18:31:14,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.67 | bwd: 209.61 | bwd_inner: 1.80 | bwd_allreduce: 207.70 | step: 1.61
 54%|█████▍    | 1900/3507 [46:28<35:01,  1.31s/it]                                                   {'loss': 0.8258, 'learning_rate': 9.13738941957533e-06, 'epoch': 0.54}
 54%|█████▍    | 1900/3507 [46:28<35:01,  1.31s/it]tensor([[-5.2188, -4.3438, -0.3848,  2.4062, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -4.2812, -0.1729,  2.5312, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -2.3438,  2.6250,  0.2139, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:14,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.43 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -4.2812,  0.3613,  1.5625, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2812, -5.4062, -1.0000, -0.0115, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -4.3125, -1.5078,  2.4219, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4219, -0.4102,  2.5156, -0.0859, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.3984, 2.9062, 5.3750, 5.6875, 1.7109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:31:16,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:31:16,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.50 | bwd_microstep: 1600.41 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1599.28 | step_microstep: 3.15
[2025-11-06 18:31:16,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.95 | bwd: 1601.42 | bwd_inner: 1.95 | bwd_allreduce: 1599.33 | step: 3.24
 54%|█████▍    | 1901/3507 [46:30<40:35,  1.52s/it]                                                   {'loss': 0.1978, 'learning_rate': 9.12818695861641e-06, 'epoch': 0.54}
 54%|█████▍    | 1901/3507 [46:30<40:35,  1.52s/it]tensor([[-1.7266, -2.4531, -1.8750,  1.5625,  0.0444]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:16,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.12 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7500, -4.8750, -1.4844,  2.8906, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.0391,  2.7344,  0.2471, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906,  0.1611,  3.2500, -0.4668, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -3.4219,  1.8203, -0.5820, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9688, -3.9375, -2.7969,  1.8438, -0.6133]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -0.7539,  3.0625, -0.6680, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -3.0781,  2.1094,  0.4473, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:31:16,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:31:16,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 194.62 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 193.39 | step_microstep: 1.88
[2025-11-06 18:31:16,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.82 | bwd: 195.66 | bwd_inner: 2.11 | bwd_allreduce: 193.42 | step: 1.96
 54%|█████▍    | 1902/3507 [46:30<32:50,  1.23s/it]                                                   {'loss': 0.1101, 'learning_rate': 9.118985241549352e-06, 'epoch': 0.54}
 54%|█████▍    | 1902/3507 [46:30<32:50,  1.23s/it]tensor([[-2.8750,  0.6680,  3.0312, -1.2422, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -3.9219, -0.8047,  3.1875, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:17,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.83 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2812, -3.4062, -1.3828,  2.0938, -1.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -2.5938,  1.3672,  0.7344, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.9375, -0.1367,  1.9531, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8125, -3.5000, -1.8047,  2.5312, -0.5508]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -2.9531,  0.5078, -0.2500, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1562, -5.3438,  0.3086,  1.8594, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:31:18,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:31:18,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.11 | bwd_microstep: 1769.69 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1768.38 | step_microstep: 1.90
[2025-11-06 18:31:18,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.96 | bwd: 1770.60 | bwd_inner: 2.05 | bwd_allreduce: 1768.42 | step: 1.98
 54%|█████▍    | 1903/3507 [46:32<39:54,  1.49s/it]                                                   {'loss': 0.9038, 'learning_rate': 9.109784276225713e-06, 'epoch': 0.54}
 54%|█████▍    | 1903/3507 [46:32<39:54,  1.49s/it]tensor([[-5.5625, -2.6875,  2.1562,  0.6016, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -2.2344,  1.4297, -0.9805, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:19,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.08 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8906, -0.1934,  2.2188, -2.2344, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5625, -5.0312,  0.3066,  2.2344, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.8906,  1.1172,  2.7969, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0156, -2.7656, -1.9219,  2.0156, -0.0515]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.2500, -2.5312,  0.8711,  1.2109, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -2.4688,  2.3281,  1.0703, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:31:19,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:31:19,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.86 | bwd_microstep: 45.20 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 43.76 | step_microstep: 1.76
[2025-11-06 18:31:19,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.97 | bwd: 45.99 | bwd_inner: 2.04 | bwd_allreduce: 43.80 | step: 1.84
 54%|█████▍    | 1904/3507 [46:33<31:19,  1.17s/it]                                                   {'loss': 0.5136, 'learning_rate': 9.100584070496401e-06, 'epoch': 0.54}
 54%|█████▍    | 1904/3507 [46:33<31:19,  1.17s/it]tensor([[-3.5156,  0.2930,  2.8438, -1.7969, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -0.2227,  3.6719, -1.1641, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:19,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.84 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-5.5625, -1.7188,  2.7500, -1.3438, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -1.4531,  2.6250, -1.4219, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8906, -3.3906, -2.2969,  1.2109, -0.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.8281,  0.1260,  3.3594, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9766,  1.5234,  2.8906, -1.6406, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -1.2266,  2.7031, -0.0840, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:31:21,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 18:31:21,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.02 | bwd_microstep: 1189.79 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1188.61 | step_microstep: 2.20
[2025-11-06 18:31:21,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.89 | bwd: 1190.75 | bwd_inner: 1.88 | bwd_allreduce: 1188.68 | step: 2.30
 54%|█████▍    | 1905/3507 [46:34<35:16,  1.32s/it]                                                   {'loss': 0.0898, 'learning_rate': 9.09138463221167e-06, 'epoch': 0.54}
 54%|█████▍    | 1905/3507 [46:34<35:16,  1.32s/it]tensor([[-5.0312, -4.6875, -1.1953,  2.3438, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0625, -3.2656,  0.7188, -0.8906, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:21,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.20 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9531, -1.8047,  1.5938,  1.0156, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9844,  0.6094,  2.7656, -1.7188, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -3.9844,  0.2285,  2.3281, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -3.8125, -0.7070,  2.4688, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -3.4219, -0.1108,  2.7031, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -3.3125,  1.1250,  2.6719, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:21,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:31:21,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.65 | bwd_microstep: 360.46 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 359.35 | step_microstep: 1.84
[2025-11-06 18:31:21,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.85 | bwd: 361.32 | bwd_inner: 1.81 | bwd_allreduce: 359.38 | step: 1.91
 54%|█████▍    | 1906/3507 [46:35<30:13,  1.13s/it]                                                   {'loss': 0.1606, 'learning_rate': 9.082185969221133e-06, 'epoch': 0.54}
 54%|█████▍    | 1906/3507 [46:35<30:13,  1.13s/it]tensor([[-4.0625, -4.7812, -3.0000,  1.6016, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:31:21,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.98 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -4.7188,  0.2656,  2.2656, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -2.9219,  0.8984,  2.3594, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3125, -0.3770,  2.6094,  0.0723, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.7188, -4.0312,  0.2910,  1.3906, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -4.7188, -1.8750,  2.8594, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -4.1875,  0.2295,  2.6562, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3125, -4.4062,  0.0134,  0.5234, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:24,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.19 | optimizer_step: 0.32
[2025-11-06 18:31:24,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.82 | bwd_microstep: 2791.01 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 2789.66 | step_microstep: 2.38
[2025-11-06 18:31:24,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.83 | bwd: 2791.85 | bwd_inner: 1.97 | bwd_allreduce: 2789.70 | step: 2.46
 54%|█████▍    | 1907/3507 [46:38<46:40,  1.75s/it]                                                   {'loss': 1.0754, 'learning_rate': 9.072988089373726e-06, 'epoch': 0.54}
 54%|█████▍    | 1907/3507 [46:38<46:40,  1.75s/it]tensor([[-3.7812, -4.4688, -2.5938,  1.8750, -1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -3.7344, -0.8203,  3.4844, -1.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:25,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.91 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5938, -4.4375, -1.4453,  2.2656, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.5469,  0.1406,  1.2109, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125,  1.8594,  3.6875, -2.1562, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -4.1250, -1.5469,  2.7656, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -1.3594,  1.7656,  0.6992, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8906,  2.0781,  4.6875, -3.0781, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:31:25,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:31:25,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.64 | bwd_microstep: 176.50 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 175.49 | step_microstep: 1.74
[2025-11-06 18:31:25,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.57 | bwd: 177.45 | bwd_inner: 1.78 | bwd_allreduce: 175.54 | step: 1.82
 54%|█████▍    | 1908/3507 [46:39<36:57,  1.39s/it]                                                   {'loss': 0.2618, 'learning_rate': 9.063791000517722e-06, 'epoch': 0.54}
 54%|█████▍    | 1908/3507 [46:39<36:57,  1.39s/it]tensor([[-5.0312, -3.9375, -0.2656,  1.4688, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -4.1875,  1.3047,  1.1562, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -3.9219, -2.4219,  2.0781, -0.8242]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:25,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.81 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.0938, -3.2188,  0.2129,  2.6406, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2500, -4.3125,  1.1719,  2.3750, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -1.7969,  1.7812,  0.0381, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7031,  0.0342,  0.9336, -1.5625, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.7500, -3.4375,  0.4336, -0.2637, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:28,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:31:28,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.58 | bwd_microstep: 2436.92 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 2435.92 | step_microstep: 2.59
[2025-11-06 18:31:28,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 436.43 | bwd: 2437.88 | bwd_inner: 1.74 | bwd_allreduce: 2435.97 | step: 2.69
 54%|█████▍    | 1909/3507 [46:42<49:09,  1.85s/it]                                                   {'loss': 0.7029, 'learning_rate': 9.054594710500723e-06, 'epoch': 0.54}
 54%|█████▍    | 1909/3507 [46:42<49:09,  1.85s/it]tensor([[-2.4688,  0.5820,  2.5469, -0.6953, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:28,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.24 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.0625, -0.2500,  2.6875, -2.1250, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -0.7344,  3.3594, -0.5547, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -2.0938,  2.7188, -0.8867, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -2.3594,  2.3750, -0.6133, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6172,  0.2227,  2.7344,  2.3750, -0.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -2.3281,  2.6094,  0.5938, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -1.9766,  1.5156, -0.1973, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:28,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:31:28,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.76 | bwd_microstep: 15.78 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 14.59 | step_microstep: 2.04
[2025-11-06 18:31:28,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.03 | bwd: 16.66 | bwd_inner: 1.92 | bwd_allreduce: 14.62 | step: 2.11
 54%|█████▍    | 1910/3507 [46:42<37:48,  1.42s/it]                                                   {'loss': 0.4205, 'learning_rate': 9.04539922716965e-06, 'epoch': 0.54}
 54%|█████▍    | 1910/3507 [46:42<37:48,  1.42s/it]tensor([[-3.7031,  0.2656,  3.0938, -1.9531, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -4.0000, -0.2451,  3.3906, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:29,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.13 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -1.6328,  3.2344, -1.1016, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -1.1875,  3.0312,  2.0938, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031, -3.2812, -0.4238,  2.3438, -1.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -4.3438, -0.6406,  2.9844, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -3.6250,  0.3965,  2.5938, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -4.8750, -0.5781,  0.5977, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:31,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:31:31,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.05 | bwd_microstep: 2441.98 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2440.71 | step_microstep: 1.96
[2025-11-06 18:31:31,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.20 | bwd: 2442.82 | bwd_inner: 1.91 | bwd_allreduce: 2440.75 | step: 2.04
 54%|█████▍    | 1911/3507 [46:45<49:07,  1.85s/it]                                                   {'loss': 0.5737, 'learning_rate': 9.036204558370725e-06, 'epoch': 0.54}
 54%|█████▍    | 1911/3507 [46:45<49:07,  1.85s/it]tensor([[-2.0000,  0.3691,  2.5156,  1.0391, -1.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -4.3125,  0.1270,  1.0234, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:31,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.96 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2188, -4.7188, -0.4180,  0.8945, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -1.1797,  2.9375, -0.1436, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -4.1562, -0.8164,  3.0469, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -2.6094,  2.4062,  0.5000, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.2500, -4.6875,  0.4785,  0.1338, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3281, -3.3906, -1.0469,  2.5469, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:31:32,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:31:32,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.66 | bwd_microstep: 39.00 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 37.87 | step_microstep: 1.55
[2025-11-06 18:31:32,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.61 | bwd: 39.92 | bwd_inner: 1.89 | bwd_allreduce: 37.90 | step: 1.64
 55%|█████▍    | 1912/3507 [46:45<37:47,  1.42s/it]                                                   {'loss': 0.374, 'learning_rate': 9.027010711949494e-06, 'epoch': 0.55}
 55%|█████▍    | 1912/3507 [46:45<37:47,  1.42s/it]tensor([[-3.5156, -2.1562,  0.6250,  1.3906, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.8125,  0.0615,  2.4375, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -1.1562,  3.0625, -0.4043, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:32,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.10 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -4.5625, -0.9414,  2.5469, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4219,  1.7656,  4.3750, -1.1484, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.7812, -1.6953,  1.4844, -1.5703, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -2.4375,  0.6953, -2.0781, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -1.1641,  2.5938, -0.5820, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:31:34,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.23
[2025-11-06 18:31:34,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.26 | bwd_microstep: 1634.43 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1633.18 | step_microstep: 2.08
[2025-11-06 18:31:34,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.38 | bwd: 1635.36 | bwd_inner: 1.99 | bwd_allreduce: 1633.22 | step: 2.16
 55%|█████▍    | 1913/3507 [46:47<43:00,  1.62s/it]                                                   {'loss': 0.836, 'learning_rate': 9.01781769575078e-06, 'epoch': 0.55}
 55%|█████▍    | 1913/3507 [46:47<43:00,  1.62s/it]tensor([[-6.9688, -4.7188,  0.6836,  1.0469, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -3.9844,  0.2891,  0.7227, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:34,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.21 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.2188, -3.5938,  0.0664,  0.9180, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -2.6094,  1.5391,  0.0635, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6562, -5.3750, -0.7500, -0.6602, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0000, -0.4414,  2.9219,  3.5000, -0.9414]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2188,  0.5586,  3.7969,  1.4531, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -1.9844,  2.5000,  1.5938, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:34,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:31:34,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.28 | bwd_microstep: 1.72 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.61 | step_microstep: 1.54
[2025-11-06 18:31:34,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 416.51 | bwd: 2.61 | bwd_inner: 1.85 | bwd_allreduce: 0.64 | step: 1.61
 55%|█████▍    | 1914/3507 [46:48<33:42,  1.27s/it]                                                   {'loss': 0.6891, 'learning_rate': 9.008625517618709e-06, 'epoch': 0.55}
 55%|█████▍    | 1914/3507 [46:48<33:42,  1.27s/it]tensor([[-6.5000, -4.7188,  0.7617,  2.1562, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:34,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.65 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.2500, -3.6719,  0.6328,  1.7188, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.6875, -8.1250, -2.2031,  0.4570, -6.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -1.2500,  2.2188, -0.6797, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0625, -4.5000,  0.2988,  2.2344, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9688, -1.2656,  2.1719,  0.4453, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -0.1387,  2.5156, -0.9531, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -5.0625, -2.8750,  1.6484, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:36,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:31:36,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 1604.98 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1603.77 | step_microstep: 1.97
[2025-11-06 18:31:36,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.29 | bwd: 1605.85 | bwd_inner: 1.92 | bwd_allreduce: 1603.81 | step: 2.04
 55%|█████▍    | 1915/3507 [46:50<39:24,  1.48s/it]                                                   {'loss': 0.6136, 'learning_rate': 8.999434185396693e-06, 'epoch': 0.55}
 55%|█████▍    | 1915/3507 [46:50<39:24,  1.48s/it]tensor([[-0.9766,  2.0938,  2.6250, -1.7031, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5625, -6.3438, -1.3906,  1.2969, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:36,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.55 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.6250, -3.6094,  0.8672,  1.2969, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9062, -2.4062,  2.8125,  0.1191, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -4.9375, -1.3047,  2.2031, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000, -4.1250, -1.3750,  2.6719, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -0.7422,  2.4062, -2.9688, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688e+00, -1.2656e+00,  2.9531e+00, -3.9062e-03, -3.9219e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:31:37,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:31:37,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.89 | bwd_microstep: 119.23 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 118.34 | step_microstep: 1.99
[2025-11-06 18:31:37,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.47 | bwd: 120.02 | bwd_inner: 1.51 | bwd_allreduce: 118.38 | step: 2.06
 55%|█████▍    | 1916/3507 [46:50<31:36,  1.19s/it]                                                   {'loss': 0.1659, 'learning_rate': 8.990243706927418e-06, 'epoch': 0.55}
 55%|█████▍    | 1916/3507 [46:50<31:36,  1.19s/it]tensor([[-5.1562, -4.4375, -0.6250,  2.2031, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:37,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.10 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0000, -4.0625,  0.6797,  1.4609, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -2.3750,  1.8281,  1.3906, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -2.5312,  0.4727,  2.6719, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625, -0.2578,  1.4297, -1.6406, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.0000, -5.0625, -2.6719, -2.8906, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8125, -4.7812,  0.0796,  0.7188, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -4.3125, -1.4922,  1.5859, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:39,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:31:39,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.26 | bwd_microstep: 1894.69 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1893.65 | step_microstep: 1.62
[2025-11-06 18:31:39,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.36 | bwd: 1895.41 | bwd_inner: 1.57 | bwd_allreduce: 1893.69 | step: 1.71
 55%|█████▍    | 1917/3507 [46:53<40:11,  1.52s/it]                                                   {'loss': 0.7642, 'learning_rate': 8.981054090052847e-06, 'epoch': 0.55}
 55%|█████▍    | 1917/3507 [46:53<40:11,  1.52s/it]tensor([[-5.6562, -3.7969,  1.2344,  2.6406, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -1.2344,  2.3906,  1.2109, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:39,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -1.8516,  2.2031, -2.2656, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9688, -4.0312,  1.5547,  0.5156, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -3.4062, -0.2432,  0.4805, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6250, -0.0386,  2.9531,  1.0547, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -0.0083,  2.7969, -2.0312, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.5469,  0.4473,  2.2656, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:31:39,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:31:39,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.80 | bwd_microstep: 151.08 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 150.00 | step_microstep: 1.68
[2025-11-06 18:31:39,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.90 | bwd: 152.04 | bwd_inner: 1.87 | bwd_allreduce: 150.04 | step: 1.76
 55%|█████▍    | 1918/3507 [46:53<32:11,  1.22s/it]                                                   {'loss': 0.4512, 'learning_rate': 8.971865342614199e-06, 'epoch': 0.55}
 55%|█████▍    | 1918/3507 [46:53<32:11,  1.22s/it]tensor([[-6.8438, -3.8438,  1.1797, -0.2891, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:39,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 56.57 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07
tensor([[-7.0625, -6.6875, -2.7500,  1.1562, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -4.6562,  0.2734,  0.2676, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5938, -5.3750, -0.9648,  1.0156, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -4.0000, -0.6016,  3.4844, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -1.4297,  2.4531,  0.7891, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -0.8008,  2.5625, -1.6172, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5312, -0.7656,  1.1484,  0.8008, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:41,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:31:41,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.43 | bwd_microstep: 1794.90 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 1794.12 | step_microstep: 2.03
[2025-11-06 18:31:41,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 130.01 | bwd: 1795.55 | bwd_inner: 1.25 | bwd_allreduce: 1794.18 | step: 2.09
 55%|█████▍    | 1919/3507 [46:55<37:58,  1.44s/it]                                                   {'loss': 0.2934, 'learning_rate': 8.962677472451956e-06, 'epoch': 0.55}
 55%|█████▍    | 1919/3507 [46:55<37:58,  1.44s/it]tensor([[-5.5312, -1.7734,  3.1250, -0.1895, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:42,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.33 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0625, -2.3750,  0.1060, -2.0156, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.5625, -3.6719,  0.4316,  0.6484, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -1.5000,  3.4375, -0.5039, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -4.1875, -0.5234,  3.6875, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -4.1562, -0.7695,  2.8125, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4219, -3.2969, -2.3281,  2.0000, -0.2197]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-7.3438, -6.3125, -1.0391,  2.2188, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:42,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:31:42,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.25 | bwd_microstep: 139.30 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 138.32 | step_microstep: 2.61
[2025-11-06 18:31:42,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.59 | bwd: 140.44 | bwd_inner: 1.93 | bwd_allreduce: 138.36 | step: 2.68
 55%|█████▍    | 1920/3507 [46:56<30:33,  1.16s/it]                                                   {'loss': 0.7253, 'learning_rate': 8.953490487405854e-06, 'epoch': 0.55}
 55%|█████▍    | 1920/3507 [46:56<30:33,  1.16s/it]tensor([[-2.6562,  0.6680,  2.6875, -1.1875, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:42,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 64.78 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.5938,  2.2031,  2.9531, -2.7656, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.6250, -4.1250,  0.0840,  1.5547, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.7188,  0.0635,  3.0625, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -0.2695,  3.6719, -2.7344, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6875, -3.0625,  2.1562, -0.6836, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -2.2188,  1.0625,  0.0488, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031,  0.7578,  4.1250, -0.9570, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:31:43,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.21 | optimizer_step: 0.24
[2025-11-06 18:31:43,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 305.22 | bwd_microstep: 670.06 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 668.95 | step_microstep: 2.51
[2025-11-06 18:31:43,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.04 | bwd: 670.88 | bwd_inner: 1.70 | bwd_allreduce: 669.00 | step: 2.61
 55%|█████▍    | 1921/3507 [46:57<30:01,  1.14s/it]                                                   {'loss': 0.7945, 'learning_rate': 8.944304395314868e-06, 'epoch': 0.55}
 55%|█████▍    | 1921/3507 [46:57<30:01,  1.14s/it]tensor([[-5.0312, -3.1406,  1.1875,  1.5938, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344, -3.6875, -1.4766,  2.6562, -0.9414]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:43,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.97 | bwd_microstep: 5.76 | bwd_inner_microstep: 5.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -0.9258,  2.9531, -0.7852, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -1.9688,  1.3359, -0.0156, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9297, -2.5156, -1.4375,  2.1875, -0.0815]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9688,  1.6875,  2.7344, -2.5156, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.5625, -5.3125, -1.5938,  2.4375, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.5000, -1.6016,  1.8203, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:31:46,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:31:46,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.07 | bwd_microstep: 2608.36 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 2607.54 | step_microstep: 2.60
[2025-11-06 18:31:46,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.06 | bwd: 2614.12 | bwd_inner: 6.38 | bwd_allreduce: 2607.59 | step: 2.70
 55%|█████▍    | 1922/3507 [47:00<44:10,  1.67s/it]                                                   {'loss': 0.5266, 'learning_rate': 8.93511920401722e-06, 'epoch': 0.55}
 55%|█████▍    | 1922/3507 [47:00<44:10,  1.67s/it]tensor([[-2.0469,  2.3281,  3.5938, -2.7188, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:46,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.98 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.8125, -3.5156,  1.9531,  0.1396, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -4.0625,  0.0447,  2.0000, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -5.9375, -1.3125,  0.5000, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -4.4062, -0.7109,  2.5000, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -3.9688,  1.0156,  2.5000, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6875, -2.0156,  1.6406, -0.0957, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4531, -3.5938, -0.1533,  4.5312, -0.9258]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:47,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:31:47,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.35 | bwd_microstep: 876.64 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 875.49 | step_microstep: 1.56
[2025-11-06 18:31:47,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.36 | bwd: 877.59 | bwd_inner: 1.92 | bwd_allreduce: 875.53 | step: 1.65
 55%|█████▍    | 1923/3507 [47:01<42:04,  1.59s/it]                                                   {'loss': 0.1454, 'learning_rate': 8.925934921350356e-06, 'epoch': 0.55}
 55%|█████▍    | 1923/3507 [47:01<42:04,  1.59s/it]tensor([[-4.7500, -3.5469,  0.1865,  1.9766, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:47,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.09 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-7.5312, -7.0625, -2.5312,  1.4688, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -3.5312, -0.1748, -1.4609, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844,  0.1226,  2.5000, -0.1914, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1250, -4.5312, -1.8672,  2.7031, -1.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -0.0549,  3.7656, -2.5938, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9219,  0.2031,  2.6250, -0.7500, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -4.6875, -0.9062,  2.6875, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:31:48,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:31:48,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.41 | bwd_microstep: 785.34 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 783.98 | step_microstep: 2.14
[2025-11-06 18:31:48,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.51 | bwd: 786.53 | bwd_inner: 2.33 | bwd_allreduce: 784.03 | step: 2.25
 55%|█████▍    | 1924/3507 [47:02<38:10,  1.45s/it]                                                   {'loss': 0.2652, 'learning_rate': 8.916751555150947e-06, 'epoch': 0.55}
 55%|█████▍    | 1924/3507 [47:02<38:10,  1.45s/it]tensor([[-4.0000, -1.3594,  2.2031,  1.2500, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:31:49,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.00 | bwd_microstep: 1.16 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-5.1875, -4.2188, -0.0579,  2.4688, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -3.3125,  0.7070,  1.0000, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -4.1875, -1.3359,  3.0625, -1.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -5.0312, -0.3203,  0.8047, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -1.1172,  3.2812, -2.1094, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0625, -2.2344,  2.7500, -0.9180, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -1.7031,  0.9023,  0.0123, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:31:50,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:31:50,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.13 | bwd_microstep: 1156.69 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1155.59 | step_microstep: 2.11
[2025-11-06 18:31:50,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.15 | bwd: 1157.85 | bwd_inner: 1.96 | bwd_allreduce: 1155.67 | step: 2.24
 55%|█████▍    | 1925/3507 [47:04<39:18,  1.49s/it]                                                   {'loss': 0.2227, 'learning_rate': 8.907569113254877e-06, 'epoch': 0.55}
 55%|█████▍    | 1925/3507 [47:04<39:18,  1.49s/it]tensor([[-1.7734,  0.0476,  3.7188,  4.0938, -0.7852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -3.4062,  0.3789,  2.6875, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -4.1562,  0.2178,  0.6523, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:50,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.72 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -1.0625,  2.4219, -0.7305, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9688, -2.5781,  0.3047,  3.2969, -1.0391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -3.7344, -0.0112,  1.7344, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -3.9375,  0.6289,  1.3516, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -2.4688,  1.2344, -0.0354, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:31:52,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:31:52,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 1205.72 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1204.84 | step_microstep: 2.14
[2025-11-06 18:31:52,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.74 | bwd: 1206.45 | bwd_inner: 1.42 | bwd_allreduce: 1204.89 | step: 2.22
 55%|█████▍    | 1926/3507 [47:05<40:07,  1.52s/it]                                                   {'loss': 0.3138, 'learning_rate': 8.898387603497259e-06, 'epoch': 0.55}
 55%|█████▍    | 1926/3507 [47:05<40:07,  1.52s/it]tensor([[-5.2500, -5.4688, -2.4219,  2.1719, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:52,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.79 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.1719,  0.1270,  1.3203, -2.9062, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -2.6562,  0.1074,  1.2188, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9688, -3.5625, -0.5273,  2.4062, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2656,  1.6250,  3.7344, -1.6562, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1406e+00, -1.7014e-03,  2.8594e+00,  1.8594e+00, -1.5859e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000,  0.5352,  3.5156, -2.7500, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -1.7656,  2.2500,  1.8125, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:54,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:31:54,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 101.82 | bwd_microstep: 1716.33 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1715.11 | step_microstep: 2.31
[2025-11-06 18:31:54,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 227.63 | bwd: 1717.09 | bwd_inner: 1.78 | bwd_allreduce: 1715.15 | step: 2.39
 55%|█████▍    | 1927/3507 [47:07<43:39,  1.66s/it]                                                   {'loss': 0.4142, 'learning_rate': 8.889207033712391e-06, 'epoch': 0.55}
 55%|█████▍    | 1927/3507 [47:07<43:39,  1.66s/it]tensor([[-3.8906, -2.5781,  0.6484,  1.9141, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:31:54,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.86 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -2.2188,  1.5938,  0.1924, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188e+00, -4.5000e+00,  1.5564e-03,  3.4375e+00, -2.6250e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.7812, -1.1719,  3.0312, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8750, -2.4531,  2.2188, -0.2559, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -3.3281,  1.8906,  1.3516, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.5312, -0.5508,  2.9688, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.0361, 2.1875, 4.4062, 3.1250, 0.1001]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:31:55,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 18:31:55,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.24 | bwd_microstep: 1574.54 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 1572.98 | step_microstep: 2.32
[2025-11-06 18:31:55,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.15 | bwd: 1575.47 | bwd_inner: 2.27 | bwd_allreduce: 1573.01 | step: 2.41
 55%|█████▍    | 1928/3507 [47:09<45:44,  1.74s/it]                                                   {'loss': 0.2549, 'learning_rate': 8.88002741173379e-06, 'epoch': 0.55}
 55%|█████▍    | 1928/3507 [47:09<45:44,  1.74s/it]tensor([[-1.9766,  1.4062,  3.2344, -0.6602, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:31:56,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.31 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-10.5625,  -6.7812,  -1.6641,  -4.4688,  -8.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8750, -5.6562, -0.8008,  1.7266, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -0.6602,  2.0781,  0.0284, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -4.4062, -1.0469,  2.0781, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.8438,  0.0115,  3.5938, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.0312,  1.0547,  0.3574, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -1.5703,  1.6562,  0.4531, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:31:56,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:31:56,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.58 | bwd_microstep: 530.12 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 528.77 | step_microstep: 2.19
[2025-11-06 18:31:56,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.89 | bwd: 531.20 | bwd_inner: 2.18 | bwd_allreduce: 528.82 | step: 2.30
 55%|█████▌    | 1929/3507 [47:10<38:58,  1.48s/it]                                                   {'loss': 0.5425, 'learning_rate': 8.870848745394131e-06, 'epoch': 0.55}
 55%|█████▌    | 1929/3507 [47:10<38:58,  1.48s/it]tensor([[-0.5820,  1.6953,  1.5938, -1.0547, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:31:57,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.99 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1250, -2.2500,  1.7734,  2.2500, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -1.4766,  3.0312, -2.0312, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -1.7188,  1.4766,  0.1934, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -4.0625,  0.2910,  2.4219, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8281,  1.8125,  3.1250, -1.8750, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.9688,  0.2812,  3.2188, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -6.8438, -3.5469,  1.3438, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:32:00,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:32:00,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.96 | bwd_microstep: 2795.48 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 2794.50 | step_microstep: 5.59
[2025-11-06 18:32:00,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.97 | bwd: 2796.18 | bwd_inner: 1.47 | bwd_allreduce: 2794.54 | step: 5.67
 55%|█████▌    | 1930/3507 [47:13<52:08,  1.98s/it]                                                   {'loss': 0.3095, 'learning_rate': 8.861671042525312e-06, 'epoch': 0.55}
 55%|█████▌    | 1930/3507 [47:13<52:08,  1.98s/it]tensor([[-3.9062, -1.4297,  1.8281,  0.7344, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1094,  2.0625,  3.7344, -1.9531, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:00,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.87 | bwd_microstep: 6.00 | bwd_inner_microstep: 5.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-4.3438, -4.6250, -1.2500,  3.5781, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3750, -3.0625,  1.6016, -0.7344, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188,  0.3926,  3.1875, -0.9258, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -4.9375, -0.4492,  1.2109, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -2.2344,  1.2109,  2.2344, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -4.5625, -1.1328,  2.8281, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:32:00,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:32:00,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.32 | bwd_microstep: 106.64 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 104.99 | step_microstep: 1.79
[2025-11-06 18:32:00,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.23 | bwd: 112.62 | bwd_inner: 7.36 | bwd_allreduce: 105.04 | step: 1.89
 55%|█████▌    | 1931/3507 [47:14<40:27,  1.54s/it]                                                   {'loss': 0.3617, 'learning_rate': 8.852494310958379e-06, 'epoch': 0.55}
 55%|█████▌    | 1931/3507 [47:14<40:27,  1.54s/it]tensor([[-3.9844, -4.1250, -1.5469,  2.3125, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -3.6406,  1.0156,  1.3438, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -0.3613,  3.0156, -1.7422, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:00,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.75 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-1.8047,  1.8438,  3.2344, -1.4219, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -5.2188, -1.6719,  1.7109, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -2.2500,  1.8828, -1.0000, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3438, -2.1250,  2.0000,  1.6094, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.5000, -1.2109,  2.5938, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:32:03,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.34
[2025-11-06 18:32:03,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.35 | bwd_microstep: 2050.53 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 2049.69 | step_microstep: 2.62
[2025-11-06 18:32:03,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.11 | bwd: 2051.52 | bwd_inner: 1.60 | bwd_allreduce: 2049.76 | step: 2.72
 55%|█████▌    | 1932/3507 [47:16<48:09,  1.83s/it]                                                   {'loss': 0.2428, 'learning_rate': 8.84331855852357e-06, 'epoch': 0.55}
 55%|█████▌    | 1932/3507 [47:16<48:09,  1.83s/it]tensor([[-5.5312, -3.1719,  2.0156,  2.0781, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -4.1562, -0.8867,  2.5312, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -5.0938,  0.0811,  2.3906, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:03,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.7891, -0.3730,  2.9062,  3.7969, -0.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.2500, -6.5000, -0.5508,  1.6172, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.3125, -4.7812,  0.9492,  1.0938, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -3.1719,  0.5586,  0.0457, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.9531,  0.2158,  3.0312, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:03,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:32:03,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.47 | bwd_microstep: 323.94 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 322.76 | step_microstep: 1.56
[2025-11-06 18:32:03,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.01 | bwd: 325.01 | bwd_inner: 2.07 | bwd_allreduce: 322.81 | step: 1.65
 55%|█████▌    | 1933/3507 [47:17<39:05,  1.49s/it]                                                   {'loss': 0.3335, 'learning_rate': 8.834143793050275e-06, 'epoch': 0.55}
 55%|█████▌    | 1933/3507 [47:17<39:05,  1.49s/it]tensor([[-6.2500, -4.3438,  0.1367,  0.6445, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -4.8438, -1.4531,  3.0000, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5938, -6.0000, -0.0747,  2.3750, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:04,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.51 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.4375, -4.8750,  1.0625,  0.9570, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -4.1562, -0.2637, -1.1719, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -4.6562, -0.8633,  3.5000, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -3.7656,  0.4316,  2.2500, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3594, -2.8125, -1.3438,  2.6250, -0.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:32:05,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:32:05,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.48 | bwd_microstep: 1714.37 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 1713.03 | step_microstep: 1.71
[2025-11-06 18:32:05,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 436.01 | bwd: 1715.38 | bwd_inner: 2.17 | bwd_allreduce: 1713.08 | step: 1.79
 55%|█████▌    | 1934/3507 [47:19<44:36,  1.70s/it]                                                   {'loss': 0.5209, 'learning_rate': 8.82497002236705e-06, 'epoch': 0.55}
 55%|█████▌    | 1934/3507 [47:19<44:36,  1.70s/it]tensor([[-3.0625,  0.2578,  2.6250, -1.0703, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5000, -3.3125,  1.0781,  1.0547, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:06,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.18 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7500, -0.5117,  3.4844, -1.7969, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -3.9062,  1.2578,  1.5625, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9375, -6.0312, -1.5703,  1.2109, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8750, -5.1250, -1.0938, -0.1406, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -4.4688, -1.0625,  2.4375, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.3594,  0.5703,  2.3281, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:32:06,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:32:06,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.32 | bwd_microstep: 93.58 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 92.70 | step_microstep: 1.93
[2025-11-06 18:32:06,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 251.52 | bwd: 94.25 | bwd_inner: 1.36 | bwd_allreduce: 92.74 | step: 2.01
 55%|█████▌    | 1935/3507 [47:20<34:15,  1.31s/it]                                                   {'loss': 0.9873, 'learning_rate': 8.81579725430159e-06, 'epoch': 0.55}
 55%|█████▌    | 1935/3507 [47:20<34:15,  1.31s/it]tensor([[-3.2656, -0.1299,  3.2188,  0.2754, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -3.3906,  0.0415,  3.4688, -1.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -0.3965,  3.8438, -2.7031, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:06,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.82 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9062, -3.9844, -0.9141,  3.1406, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -2.7656,  2.5938, -0.3262, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.9531, -0.3555,  3.1250, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -0.0542,  2.1094, -2.4375, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031, -2.9219, -1.1406,  2.0938, -0.8164]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:32:09,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:32:09,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.87 | bwd_microstep: 2499.98 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 2498.96 | step_microstep: 2.40
[2025-11-06 18:32:09,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 470.73 | bwd: 2500.75 | bwd_inner: 1.57 | bwd_allreduce: 2499.01 | step: 2.49
 55%|█████▌    | 1936/3507 [47:23<47:38,  1.82s/it]                                                   {'loss': 0.4297, 'learning_rate': 8.806625496680747e-06, 'epoch': 0.55}
 55%|█████▌    | 1936/3507 [47:23<47:38,  1.82s/it]tensor([[-3.9219, -2.2812,  0.4219,  0.5859, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -5.1250, -1.8047,  3.0312, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4062, -5.7188, -0.8047,  0.8633, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -1.5312,  3.3594, -1.2422, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:09,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.05 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.5625, -3.7969, -0.1699,  2.3594, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -1.2734,  1.2812,  0.5625, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -4.2812, -0.0747,  0.2852, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -2.5938,  1.7422,  1.0625, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:09,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:32:09,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.66 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.45
[2025-11-06 18:32:09,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.74 | bwd: 2.71 | bwd_inner: 1.76 | bwd_allreduce: 0.82 | step: 1.54
 55%|█████▌    | 1937/3507 [47:23<36:38,  1.40s/it]                                                   {'loss': 0.3849, 'learning_rate': 8.797454757330504e-06, 'epoch': 0.55}
 55%|█████▌    | 1937/3507 [47:23<36:38,  1.40s/it]tensor([[-1.8594,  1.3984,  2.2969, -2.2500, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.8750, -4.4688,  1.3828,  1.7578, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:09,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.74 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.8125, -5.5625, -1.0156,  1.3125, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -3.1562,  0.2393,  2.7344, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -5.5000, -2.0469,  2.4375, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -4.1250,  0.4727,  1.4219, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9062, -1.3594,  0.3945,  2.3438, -0.5898]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -0.2500,  2.5469, -0.5195, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:32:10,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:32:10,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.27 | bwd_microstep: 182.21 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 181.09 | step_microstep: 2.33
[2025-11-06 18:32:10,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.04 | bwd: 183.06 | bwd_inner: 1.79 | bwd_allreduce: 181.13 | step: 2.41
 55%|█████▌    | 1938/3507 [47:24<29:43,  1.14s/it]                                                   {'loss': 0.3311, 'learning_rate': 8.788285044075982e-06, 'epoch': 0.55}
 55%|█████▌    | 1938/3507 [47:24<29:43,  1.14s/it]tensor([[-4.6562, -3.0938,  1.7344,  3.4688, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719, -3.2812, -1.6875,  2.3125, -0.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -2.9219,  0.9258,  2.4844, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:10,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.86 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -4.8750, -1.6562,  2.0000, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -2.7656,  1.4844,  0.7188, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7812, -0.7422,  1.7969, -1.0156, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[h264 @ 0xd375500] mmco: unref short failure
tensor([[-4.9062, -2.0938,  1.8984,  0.3555, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -4.7188,  0.5977,  2.5000, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:13,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.32
[2025-11-06 18:32:13,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.81 | bwd_microstep: 2352.32 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 2351.16 | step_microstep: 2.34
[2025-11-06 18:32:13,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.70 | bwd: 2353.20 | bwd_inner: 1.85 | bwd_allreduce: 2351.21 | step: 2.42
 55%|█████▌    | 1939/3507 [47:27<43:38,  1.67s/it]                                                   {'loss': 0.1711, 'learning_rate': 8.77911636474142e-06, 'epoch': 0.55}
 55%|█████▌    | 1939/3507 [47:27<43:38,  1.67s/it]tensor([[-5.4688, -5.3125, -1.9219,  2.0469, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:13,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.59 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6562, -4.5938, -1.3750,  2.7344, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125,  0.0474,  3.9531, -1.4453, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -4.3750,  1.5625,  1.6719, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9297,  1.5312,  2.1406, -2.0781, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.6250, -6.7188, -1.8516, -0.8477, -6.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -3.5781,  2.5469,  1.2031, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -2.1406,  1.4062,  3.4688, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:13,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:32:13,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.45 | bwd_microstep: 104.86 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 103.77 | step_microstep: 1.49
[2025-11-06 18:32:13,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.06 | bwd: 105.69 | bwd_inner: 1.75 | bwd_allreduce: 103.81 | step: 1.57
 55%|█████▌    | 1940/3507 [47:27<34:24,  1.32s/it]                                                   {'loss': 0.3717, 'learning_rate': 8.769948727150172e-06, 'epoch': 0.55}
 55%|█████▌    | 1940/3507 [47:27<34:24,  1.32s/it]tensor([[-5.0625, -0.8828,  3.2969, -1.7344, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.9844, -0.3516,  2.8125, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.7656, -0.7930,  1.6875, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -1.3281, -0.0317, -3.3594, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2578,  1.4766,  2.1250, -1.1016, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:32:13,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.78 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.3750, -4.0312,  0.3164,  2.2500, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -2.0938,  1.6484,  1.4844, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9375,  0.4023,  2.4688, -1.5469, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:32:14,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:32:14,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.71 | bwd_microstep: 97.25 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 96.24 | step_microstep: 1.99
[2025-11-06 18:32:14,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.52 | bwd: 98.08 | bwd_inner: 1.68 | bwd_allreduce: 96.28 | step: 2.06
 55%|█████▌    | 1941/3507 [47:28<28:31,  1.09s/it]                                                   {'loss': 0.5716, 'learning_rate': 8.760782139124711e-06, 'epoch': 0.55}
 55%|█████▌    | 1941/3507 [47:28<28:31,  1.09s/it]tensor([[-3.0781, -0.6250,  2.1875,  1.1406, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -5.1250, -1.0625,  2.0469, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:14,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.75 | bwd_microstep: 2.29 | bwd_inner_microstep: 2.15 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.4375, -3.1875,  1.8281,  1.8672, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125,  0.5039,  4.0938, -2.5469, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9062, -3.2031,  2.3906, -0.5898, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.0000, -3.3281,  0.9844, -2.3906, -6.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7969, -3.4844, -1.3125,  3.5000, -0.3770]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -3.7188,  0.2539,  1.1016, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:16,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.81 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:32:16,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.07 | bwd_microstep: 1455.98 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1454.76 | step_microstep: 2.65
[2025-11-06 18:32:16,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.85 | bwd: 1458.27 | bwd_inner: 3.28 | bwd_allreduce: 1454.81 | step: 2.67
 55%|█████▌    | 1942/3507 [47:29<34:06,  1.31s/it]                                                   {'loss': 0.3381, 'learning_rate': 8.75161660848661e-06, 'epoch': 0.55}
 55%|█████▌    | 1942/3507 [47:29<34:06,  1.31s/it]tensor([[-5.8750, -5.2188, -0.7969,  2.5469, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -5.3750, -0.0598,  2.9375, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.9375, -1.5625,  1.8750, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875,  0.0815,  2.6094, -3.2344, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:16,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.48 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5781,  0.7188,  3.6875, -2.3438, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -4.9062, -1.1094,  1.9141, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -4.2188, -0.8359,  2.4219, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2031, -0.5781,  2.1406, -0.2207, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:32:17,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:32:17,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.20 | bwd_microstep: 442.42 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 441.46 | step_microstep: 1.73
[2025-11-06 18:32:17,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.67 | bwd: 443.32 | bwd_inner: 1.66 | bwd_allreduce: 441.51 | step: 1.82
 55%|█████▌    | 1943/3507 [47:30<31:41,  1.22s/it]                                                   {'loss': 0.0657, 'learning_rate': 8.742452143056543e-06, 'epoch': 0.55}
 55%|█████▌    | 1943/3507 [47:30<31:41,  1.22s/it]tensor([[-5.6250, -1.2969,  3.4688, -1.6094, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:17,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.80 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -5.2500, -0.6289,  3.2344, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -1.1953,  2.9219,  0.0776, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -2.1875,  1.8906,  0.7891, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -2.9844,  2.0938,  1.5000, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -1.6875,  3.1406, -1.4141, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7188,  1.0859,  2.9062, -2.0156, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-9.1250, -7.1875, -4.0000, -3.2812, -6.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:18,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:32:18,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.77 | bwd_microstep: 942.91 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 941.70 | step_microstep: 2.06
[2025-11-06 18:32:18,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.59 | bwd: 943.92 | bwd_inner: 2.04 | bwd_allreduce: 941.75 | step: 2.14
 55%|█████▌    | 1944/3507 [47:32<32:00,  1.23s/it]                                                   {'loss': 0.2583, 'learning_rate': 8.733288750654271e-06, 'epoch': 0.55}
 55%|█████▌    | 1944/3507 [47:32<32:00,  1.23s/it]tensor([[-4.3438, -1.3359,  2.5938,  0.6211, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -4.7500, -2.0625,  1.7422, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:18,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.41 | bwd_microstep: 1.14 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.3750, -5.5000, -0.4961,  2.9375, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5312, -2.9375,  2.9062,  0.3496, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -5.3438, -2.8906,  1.8828, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.2188, -1.0625,  2.6250, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8438, -3.6406,  2.1094,  0.2695, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.0312,  0.0742,  1.5859, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:19,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:32:19,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.94 | bwd_microstep: 933.65 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 932.37 | step_microstep: 2.83
[2025-11-06 18:32:19,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.37 | bwd: 934.80 | bwd_inner: 2.18 | bwd_allreduce: 932.43 | step: 2.94
 55%|█████▌    | 1945/3507 [47:33<32:27,  1.25s/it]                                                   {'loss': 0.0912, 'learning_rate': 8.724126439098645e-06, 'epoch': 0.55}
 55%|█████▌    | 1945/3507 [47:33<32:27,  1.25s/it]tensor([[-8.6250, -7.1875, -2.0156,  0.4336, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:19,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.80 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.9688, -4.0000,  1.5000,  0.3809, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -3.5469,  1.8516,  0.2891, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -1.0703,  2.9531, -1.5469, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -3.5938,  0.1807,  2.2500, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -4.0312,  0.0796,  3.5781, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.1562,  0.9023,  3.3281, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -3.9219, -0.1162,  3.1562, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:22,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 18:32:22,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.72 | bwd_microstep: 2409.53 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 2408.59 | step_microstep: 2.42
[2025-11-06 18:32:22,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.54 | bwd: 2410.28 | bwd_inner: 1.49 | bwd_allreduce: 2408.63 | step: 2.51
 55%|█████▌    | 1946/3507 [47:36<44:16,  1.70s/it]                                                   {'loss': 0.3709, 'learning_rate': 8.714965216207587e-06, 'epoch': 0.55}
 55%|█████▌    | 1946/3507 [47:36<44:16,  1.70s/it]tensor([[-5.9688, -4.6250,  0.1709,  2.3125, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -5.0312, -1.8750,  2.5625, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -1.1406,  3.0781,  0.0286, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:22,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.09 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4688, -5.8750, -1.5234,  1.9375, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -3.9062,  0.6328, -0.3652, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -1.1484,  1.6406, -0.1152, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -4.4688, -0.5078,  1.9062, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219,  0.2793,  3.8750, -1.2188, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:32:22,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:32:22,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.53 | bwd_microstep: 10.29 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 9.14 | step_microstep: 1.40
[2025-11-06 18:32:22,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.33 | bwd: 11.26 | bwd_inner: 1.95 | bwd_allreduce: 9.18 | step: 1.48
 56%|█████▌    | 1947/3507 [47:36<34:24,  1.32s/it]                                                   {'loss': 0.4941, 'learning_rate': 8.705805089798089e-06, 'epoch': 0.56}
 56%|█████▌    | 1947/3507 [47:36<34:24,  1.32s/it]tensor([[-4.2500, -0.2109,  2.9219, -2.0469, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[ 0.0869, -0.9883, -0.6211,  3.4688,  1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:32:22,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.88 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.8281, -2.7344,  0.8555,  2.9375, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -4.7500, -2.0625,  2.8906, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -2.5469,  1.5234,  3.3281, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -2.3438,  1.4141,  0.4609, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -2.4531,  1.5469,  1.4531, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438, -5.3750, -0.9648,  0.8594, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:24,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 18:32:24,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.40 | bwd_microstep: 1266.43 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1265.20 | step_microstep: 2.14
[2025-11-06 18:32:24,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.30 | bwd: 1267.43 | bwd_inner: 2.05 | bwd_allreduce: 1265.24 | step: 2.22
 56%|█████▌    | 1948/3507 [47:38<36:49,  1.42s/it]                                                   {'loss': 0.6682, 'learning_rate': 8.69664606768622e-06, 'epoch': 0.56}
 56%|█████▌    | 1948/3507 [47:38<36:49,  1.42s/it]tensor([[-7.0000, -4.0625,  1.2891,  0.3320, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531,  0.2129,  1.4141, -3.2344, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -4.5625, -1.6328,  1.1016, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -2.8281,  2.3750,  0.0986, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:24,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.93 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-7.2812, -6.4688, -1.8828,  1.4688, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6250, -3.0781, -0.7734,  3.5781, -0.3789]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -3.2969,  0.9766,  1.4844, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3281, -3.7031, -1.7031,  2.3281, -1.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:25,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:32:25,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.37 | bwd_microstep: 331.58 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 330.42 | step_microstep: 1.81
[2025-11-06 18:32:25,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.29 | bwd: 332.53 | bwd_inner: 1.89 | bwd_allreduce: 330.47 | step: 1.91
 56%|█████▌    | 1949/3507 [47:39<31:48,  1.22s/it]                                                   {'loss': 0.2392, 'learning_rate': 8.68748815768709e-06, 'epoch': 0.56}
 56%|█████▌    | 1949/3507 [47:39<31:48,  1.22s/it]tensor([[-3.4219, -3.5781, -0.0547,  4.6250, -0.8945]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:25,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.21 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.7500, -4.0938,  0.1113,  1.2969, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -1.8203,  2.5156, -0.2070, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -1.2109,  2.5781, -0.3320, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -1.3203,  2.5000, -1.2969, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312,  0.9102,  4.1250, -1.7344, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -3.4219,  0.7500,  1.2812, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -4.6250, -1.0781,  1.3203, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:28,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:32:28,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.78 | bwd_microstep: 3301.83 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 3300.68 | step_microstep: 2.11
[2025-11-06 18:32:28,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.02 | bwd: 3302.83 | bwd_inner: 1.95 | bwd_allreduce: 3300.73 | step: 2.20
 56%|█████▌    | 1950/3507 [47:42<50:50,  1.96s/it]                                                   {'loss': 0.1378, 'learning_rate': 8.67833136761488e-06, 'epoch': 0.56}
 56%|█████▌    | 1950/3507 [47:42<50:50,  1.96s/it]tensor([[-5.7500, -2.8750,  1.8594,  0.5508, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -4.9062, -1.1953,  0.6445, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -0.2871,  2.4531, -0.6406, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:29,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.06 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.3750, -5.0938,  0.0391,  2.5781, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.9688, -5.0000,  1.0859,  0.1680, -6.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0781,  1.1328,  3.1875, -2.4375, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1875,  0.1328,  2.5000, -1.4531, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -2.9844,  0.4688,  2.1875, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:29,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.27 | optimizer_step: 0.37
[2025-11-06 18:32:29,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.80 | bwd_microstep: 2.33 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1.03 | step_microstep: 2.20
[2025-11-06 18:32:29,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.89 | bwd: 3.60 | bwd_inner: 2.34 | bwd_allreduce: 1.08 | step: 2.31
 56%|█████▌    | 1951/3507 [47:43<38:50,  1.50s/it]                                                   {'loss': 0.3325, 'learning_rate': 8.669175705282791e-06, 'epoch': 0.56}
 56%|█████▌    | 1951/3507 [47:43<38:50,  1.50s/it]tensor([[-2.8438,  0.1494,  1.4531, -2.1875, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:29,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.49 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5000, -1.3594,  1.9766, -0.8750, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7812, -6.0000, -1.3281,  2.1719, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1562, -2.8281, -0.0781,  3.1094, -1.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -0.5039,  1.5234, -2.2344, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.3125,  1.1953,  3.1250, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -2.1562,  0.7383,  1.5859, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -1.0469,  2.6094,  1.5312, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:32:31,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:32:31,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.44 | bwd_microstep: 1669.60 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1668.44 | step_microstep: 2.79
[2025-11-06 18:32:31,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 263.96 | bwd: 1670.55 | bwd_inner: 1.89 | bwd_allreduce: 1668.49 | step: 2.88
 56%|█████▌    | 1952/3507 [47:45<42:29,  1.64s/it]                                                   {'loss': 0.5819, 'learning_rate': 8.660021178503082e-06, 'epoch': 0.56}
 56%|█████▌    | 1952/3507 [47:45<42:29,  1.64s/it]tensor([[-4.4062, -5.0312, -3.1250,  1.4844, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -1.3672,  2.1406, -1.5547, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.5938, -3.0625, -2.1094,  1.3125, -0.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.6562, -1.3438,  2.7812, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:31,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.02 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.6875, -3.9531, -1.4062,  2.5938, -1.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -1.4922,  2.9062, -0.3789, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -0.3145,  3.9375, -0.2324, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-9.3750, -6.8125, -1.0078, -0.9141, -6.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:31,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:32:31,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.10 | bwd_microstep: 64.17 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 63.29 | step_microstep: 1.95
[2025-11-06 18:32:31,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.16 | bwd: 64.88 | bwd_inner: 1.40 | bwd_allreduce: 63.33 | step: 2.04
 56%|█████▌    | 1953/3507 [47:45<33:55,  1.31s/it]                                                   {'loss': 0.5737, 'learning_rate': 8.650867795087032e-06, 'epoch': 0.56}
 56%|█████▌    | 1953/3507 [47:45<33:55,  1.31s/it]tensor([[-3.7500, -1.2109,  1.7656, -0.0092, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:32,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.37 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -3.9062, -0.6172,  2.4062, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4062, -3.7656, -1.6250,  2.5625, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6250, -6.0938, -1.0938,  0.9258, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -4.0938,  1.1641,  1.6016, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6562, -0.9570,  1.2891,  3.2344, -0.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -1.7500,  1.3906, -0.0898, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-3.9531, -4.1250, -1.3906,  2.8750, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([2], device='cuda:3')
tensor([3], device='cuda:2')
[2025-11-06 18:32:32,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:32:32,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.31 | bwd_microstep: 104.45 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 103.10 | step_microstep: 2.56
[2025-11-06 18:32:32,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.71 | bwd: 105.17 | bwd_inner: 1.86 | bwd_allreduce: 103.15 | step: 2.65
 56%|█████▌    | 1954/3507 [47:46<27:07,  1.05s/it]                                                   {'loss': 0.4178, 'learning_rate': 8.641715562844952e-06, 'epoch': 0.56}
 56%|█████▌    | 1954/3507 [47:46<27:07,  1.05s/it]tensor([[-3.6250, -4.0312, -1.4844,  3.0156, -1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.6406, -0.2812,  1.7344, -0.1270, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9688, -2.4375,  2.5312, -0.5156, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -3.4062,  1.3359,  1.2266, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -3.3438,  0.4297,  1.0547, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8125, -3.9375,  1.9219,  1.1328, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:33,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.72 | bwd_microstep: 14.62 | bwd_inner_microstep: 14.49 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7031, -4.0312, -1.5312,  2.7031, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -3.0000,  0.8984,  0.7500, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:33,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:32:33,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.42 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.13
[2025-11-06 18:32:33,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.18 | bwd: 16.27 | bwd_inner: 15.30 | bwd_allreduce: 0.82 | step: 2.22
 56%|█████▌    | 1955/3507 [47:47<30:58,  1.20s/it]                                                   {'loss': 1.0417, 'learning_rate': 8.63256448958616e-06, 'epoch': 0.56}
 56%|█████▌    | 1955/3507 [47:47<30:58,  1.20s/it]tensor([[-6.0938, -6.0312, -2.0312,  2.4062, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -3.5469, -0.2695,  3.2500, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -3.7188,  0.3320,  0.3223, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:34,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.55 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.8828, -2.4375, -1.3828,  2.2344, -0.0674]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -4.2500, -0.0150,  1.1250, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -3.0625,  0.7109,  1.8359, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8750, -1.6719,  3.1094, -1.4531, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -0.2246,  3.6875, -1.4453, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:32:36,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:32:36,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.34 | bwd_microstep: 1870.69 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1869.62 | step_microstep: 2.07
[2025-11-06 18:32:36,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.92 | bwd: 1871.35 | bwd_inner: 1.53 | bwd_allreduce: 1869.67 | step: 2.15
 56%|█████▌    | 1956/3507 [47:49<39:13,  1.52s/it]                                                   {'loss': 0.1922, 'learning_rate': 8.623414583119003e-06, 'epoch': 0.56}
 56%|█████▌    | 1956/3507 [47:49<39:13,  1.52s/it]tensor([[-5.0938, -1.2266,  3.1562, -0.4219, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5156,  0.6094,  2.5312, -1.0312, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -3.8906,  0.4180,  2.0781, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:36,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.93 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8125, -2.1719,  2.5156, -0.5508, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9844, -0.5820,  0.9727, -3.0938, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7812,  1.0234,  4.5000, -2.2344, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -3.8750, -0.5586,  3.0312, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -2.8281,  2.2188, -1.3047, -5.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:32:37,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:32:37,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.19 | bwd_microstep: 863.33 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 862.36 | step_microstep: 1.91
[2025-11-06 18:32:37,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.13 | bwd: 864.04 | bwd_inner: 1.47 | bwd_allreduce: 862.41 | step: 1.99
 56%|█████▌    | 1957/3507 [47:51<36:56,  1.43s/it]                                                   {'loss': 0.2904, 'learning_rate': 8.61426585125081e-06, 'epoch': 0.56}
 56%|█████▌    | 1957/3507 [47:51<36:56,  1.43s/it]tensor([[-3.1719,  0.9219,  3.6719, -1.5078, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -3.5156, -0.5820,  3.5312, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -3.6406,  1.2500,  1.6016, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344,  1.2266,  3.9531, -0.1504, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -1.3203,  3.1719, -0.5391, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:38,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.82 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5625, -4.1562,  1.0234,  0.8594, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -5.4062, -1.3359,  2.8438, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.0625, -0.8594,  1.8594, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:39,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:32:39,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.19 | bwd_microstep: 369.87 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 368.95 | step_microstep: 2.75
[2025-11-06 18:32:39,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 446.03 | bwd: 370.71 | bwd_inner: 1.55 | bwd_allreduce: 369.01 | step: 2.83
 56%|█████▌    | 1958/3507 [47:52<39:36,  1.53s/it]                                                   {'loss': 0.2485, 'learning_rate': 8.605118301787925e-06, 'epoch': 0.56}
 56%|█████▌    | 1958/3507 [47:52<39:36,  1.53s/it][h264 @ 0x9d49780] mmco: unref short failure
[h264 @ 0x9d49780] mmco: unref short failure
tensor([[-2.9844,  1.2422,  3.5312, -2.1562, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -3.2188,  1.2188,  3.1719, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -0.7305,  2.9062, -0.6172, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:39,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.53 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-5.5000, -2.0156,  2.4219, -0.3008, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8438,  2.5938,  2.4844, -2.1406, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-0.6836,  2.0625,  2.6875, -0.3379, -1.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500, -3.3438, -1.1797,  3.2656, -0.5039]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.3750, -0.3008,  1.3672, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:41,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.23 | optimizer_step: 0.33
[2025-11-06 18:32:41,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.52 | bwd_microstep: 2009.05 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2007.98 | step_microstep: 2.48
[2025-11-06 18:32:41,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.07 | bwd: 2009.84 | bwd_inner: 1.61 | bwd_allreduce: 2008.05 | step: 2.61
 56%|█████▌    | 1959/3507 [47:55<46:11,  1.79s/it]                                                   {'loss': 0.2191, 'learning_rate': 8.595971942535673e-06, 'epoch': 0.56}
 56%|█████▌    | 1959/3507 [47:55<46:11,  1.79s/it]tensor([[-3.7031, -4.3125, -2.5938,  1.8516, -1.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:41,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.68 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.4609,  0.8633,  3.2812,  1.2656, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -3.7344, -0.1006,  2.8750, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.5625, -6.2188, -0.9414, -0.5039, -6.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -2.5312,  1.5000,  1.6406, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -1.3828,  1.7656, -1.0859, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -0.6094,  3.4844, -1.5859, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -4.1562, -2.1250,  2.6406, -0.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:42,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:32:42,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.75 | bwd_microstep: 1001.53 | bwd_inner_microstep: 2.80 | bwd_allreduce_microstep: 998.63 | step_microstep: 2.38
[2025-11-06 18:32:42,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.45 | bwd: 1002.29 | bwd_inner: 3.47 | bwd_allreduce: 998.67 | step: 2.46
 56%|█████▌    | 1960/3507 [47:56<43:00,  1.67s/it]                                                   {'loss': 0.2224, 'learning_rate': 8.586826781298373e-06, 'epoch': 0.56}
 56%|█████▌    | 1960/3507 [47:56<43:00,  1.67s/it]tensor([[-2.7500, -3.5781, -2.3438,  1.9609, -0.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -3.9375,  0.4883,  1.6797, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1797,  2.9375,  4.4688, -1.7344, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:43,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.51 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4688,  0.2012,  3.1562, -0.8906, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -0.6602,  3.6406, -1.3438, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -2.0938,  2.5000, -0.3848, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3125, -3.0156,  2.0625, -0.1826, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.4375, -1.6719,  0.3047, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:46,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:32:46,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.91 | bwd_microstep: 3174.02 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 3172.89 | step_microstep: 1.90
[2025-11-06 18:32:46,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 481.45 | bwd: 3174.84 | bwd_inner: 1.75 | bwd_allreduce: 3172.95 | step: 1.99
 56%|█████▌    | 1961/3507 [48:00<58:41,  2.28s/it]                                                   {'loss': 0.5558, 'learning_rate': 8.577682825879312e-06, 'epoch': 0.56}
 56%|█████▌    | 1961/3507 [48:00<58:41,  2.28s/it]tensor([[-4.8438, -2.7188,  1.5703,  1.1172, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -5.8125, -0.5781,  1.7812, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:46,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -4.3125, -0.3105,  2.4062, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0000, -6.0625, -0.9531,  2.0938, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -1.4688,  2.2344, -0.5938, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -5.1250, -2.3125,  2.9844, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4688,  1.4141,  2.7656, -2.5938, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -0.5039,  3.5156, -1.5547, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:32:47,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:32:47,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.32 | bwd_microstep: 42.98 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 41.77 | step_microstep: 1.39
[2025-11-06 18:32:47,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.88 | bwd: 43.87 | bwd_inner: 1.92 | bwd_allreduce: 41.81 | step: 1.47
 56%|█████▌    | 1962/3507 [48:00<44:33,  1.73s/it]                                                   {'loss': 0.1348, 'learning_rate': 8.568540084080755e-06, 'epoch': 0.56}
 56%|█████▌    | 1962/3507 [48:00<44:33,  1.73s/it]tensor([[1.1797, 3.2188, 4.9375, 3.2344, 0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -3.6719,  1.0938,  2.4531, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5156, -2.7656,  0.8555,  3.5781, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:47,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.24 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.1562, -2.5469, -0.1611,  3.9219, -0.1338]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -4.2188, -2.3594,  1.7500, -1.3672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.5000, -0.5820,  2.0625, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -3.3281,  2.0938,  0.8672, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9531, -4.4688, -2.4219,  2.0156, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:47,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:32:47,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.08 | bwd_microstep: 217.21 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 216.23 | step_microstep: 2.11
[2025-11-06 18:32:47,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 475.35 | bwd: 217.96 | bwd_inner: 1.48 | bwd_allreduce: 216.29 | step: 2.20
 56%|█████▌    | 1963/3507 [48:01<36:52,  1.43s/it]                                                   {'loss': 0.8219, 'learning_rate': 8.559398563703924e-06, 'epoch': 0.56}
 56%|█████▌    | 1963/3507 [48:01<36:52,  1.43s/it]tensor([[-3.5938, -4.0000, -1.8047,  2.4375, -1.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -0.7578,  3.1250, -0.2637, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:32:47,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.91 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([2], device='cuda:3')
tensor([[-6.4375, -5.3438, -0.0806,  3.0312, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -3.7969, -1.2578,  3.4219, -0.8086]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -3.7344,  0.4961,  3.1875, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -0.6523,  3.2500, -1.9922, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -5.0625, -2.0312,  2.5000, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -3.9062,  0.4570,  2.5469, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:32:48,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:32:48,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.05 | bwd_microstep: 460.75 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 459.81 | step_microstep: 2.17
[2025-11-06 18:32:48,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 261.98 | bwd: 461.51 | bwd_inner: 1.47 | bwd_allreduce: 459.87 | step: 2.26
 56%|█████▌    | 1964/3507 [48:02<31:48,  1.24s/it]                                                   {'loss': 0.0509, 'learning_rate': 8.55025827254901e-06, 'epoch': 0.56}
 56%|█████▌    | 1964/3507 [48:02<31:48,  1.24s/it]tensor([[-4.7812, -3.3125, -0.2637,  0.5664, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.1875,  1.1484,  1.5859, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:48,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.36 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.3125, -5.2500,  0.5977,  1.5000, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -4.6875, -1.0859,  0.7969, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -1.7656,  2.5156,  3.4688, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4062, -4.6562, -0.9922,  1.6875, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -1.0234,  2.5000,  0.3750, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -4.1562, -0.3555, -0.0962, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:50,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 18:32:50,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.88 | bwd_microstep: 1893.02 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1891.71 | step_microstep: 2.18
[2025-11-06 18:32:50,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.24 | bwd: 1893.75 | bwd_inner: 1.83 | bwd_allreduce: 1891.76 | step: 2.26
 56%|█████▌    | 1965/3507 [48:04<39:58,  1.56s/it]                                                   {'loss': 0.4921, 'learning_rate': 8.541119218415144e-06, 'epoch': 0.56}
 56%|█████▌    | 1965/3507 [48:04<39:58,  1.56s/it]tensor([[-2.2812, -0.5938,  2.6406,  3.0000, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -1.0938,  1.1797, -0.8242, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:51,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.77 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-0.7383,  1.8203,  1.9453, -1.1250, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.0938, -4.5625, -0.2246,  3.5938, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -2.9531,  2.7500,  0.1982, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.6250, -4.4688,  1.4609,  0.1572, -5.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -3.7031,  0.2812,  0.4219, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -1.4141,  2.3438,  0.0164, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:32:51,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:32:51,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 98.85 | bwd_microstep: 848.92 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 847.98 | step_microstep: 1.97
[2025-11-06 18:32:51,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.64 | bwd: 849.99 | bwd_inner: 1.80 | bwd_allreduce: 848.04 | step: 2.07
 56%|█████▌    | 1966/3507 [48:05<36:59,  1.44s/it]                                                   {'loss': 0.3851, 'learning_rate': 8.531981409100409e-06, 'epoch': 0.56}
 56%|█████▌    | 1966/3507 [48:05<36:59,  1.44s/it]tensor([[-1.7266,  1.7578,  2.9219, -1.7188, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:52,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.49 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-5.1875, -4.0312, -0.1226,  2.1562, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -3.6250, -1.4688,  2.5469, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -3.6250, -0.5039,  2.3594, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -4.9688, -1.1016,  3.0938, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5938,  2.1094,  3.3750, -1.8203, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.1250, -0.2129,  3.6406, -0.6992, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -2.0469,  1.5391,  0.5742, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:32:54,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.25
[2025-11-06 18:32:54,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.92 | bwd_microstep: 1923.82 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1922.70 | step_microstep: 2.62
[2025-11-06 18:32:54,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.44 | bwd: 1924.82 | bwd_inner: 1.89 | bwd_allreduce: 1922.77 | step: 2.75
 56%|█████▌    | 1967/3507 [48:08<44:14,  1.72s/it]                                                   {'loss': 0.5909, 'learning_rate': 8.522844852401824e-06, 'epoch': 0.56}
 56%|█████▌    | 1967/3507 [48:08<44:14,  1.72s/it]tensor([[-4.9062, -3.7812, -0.2422,  2.1875, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:54,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.99 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-6.0625, -4.6250,  0.1113,  1.6953, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -5.9688, -2.2656,  1.9766, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -0.7812,  2.2969,  0.0304, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -4.7500, -0.7305,  2.5625, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2812, -3.3750,  2.0625,  1.0781, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -2.4531,  2.6094,  1.3984, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -3.2969,  1.8438,  2.1250, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:54,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:32:54,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.83 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.67
[2025-11-06 18:32:54,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 436.84 | bwd: 3.14 | bwd_inner: 2.05 | bwd_allreduce: 0.92 | step: 1.76
 56%|█████▌    | 1968/3507 [48:08<34:39,  1.35s/it]                                                   {'loss': 0.3923, 'learning_rate': 8.513709556115335e-06, 'epoch': 0.56}
 56%|█████▌    | 1968/3507 [48:08<34:39,  1.35s/it]tensor([[-5.4688, -5.2188, -1.4688,  2.5781, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.0625, -1.6484,  1.8516,  0.5430, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([2], device='cuda:0')
[2025-11-06 18:32:55,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.78 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3438, -4.2812,  0.6289,  3.9688, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5938, -2.7031,  1.2812,  1.7656, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -5.8438,  0.2227,  2.7500, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -4.1250, -0.4609,  2.1250, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -5.3125, -2.8750,  1.9688, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -4.0938, -0.1216,  3.4844, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:32:57,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:32:57,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.85 | bwd_microstep: 2273.18 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 2272.04 | step_microstep: 2.28
[2025-11-06 18:32:57,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.64 | bwd: 2274.05 | bwd_inner: 1.81 | bwd_allreduce: 2272.08 | step: 2.37
 56%|█████▌    | 1969/3507 [48:11<44:08,  1.72s/it]                                                   {'loss': 0.5473, 'learning_rate': 8.504575528035816e-06, 'epoch': 0.56}
 56%|█████▌    | 1969/3507 [48:11<44:08,  1.72s/it]tensor([[-6.5625, -6.1250, -2.6250,  1.0781, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1250, -1.1641,  2.7500,  3.1719, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:32:57,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.52 | bwd_microstep: 1.20 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.6875, -1.4609,  1.9609, -0.7500, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.4688,  0.1777,  1.7422, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719, -2.9844,  0.3965,  4.2812, -0.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -3.2031,  0.5000,  3.6875, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -3.4688,  0.3320,  3.1094, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -0.4551,  1.8828, -0.8086, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:32:57,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:32:57,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.12 | bwd_microstep: 55.34 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 54.24 | step_microstep: 2.13
[2025-11-06 18:32:57,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.66 | bwd: 56.54 | bwd_inner: 2.06 | bwd_allreduce: 54.29 | step: 2.24
 56%|█████▌    | 1970/3507 [48:11<34:14,  1.34s/it]                                                   {'loss': 0.392, 'learning_rate': 8.49544277595706e-06, 'epoch': 0.56}
 56%|█████▌    | 1970/3507 [48:11<34:14,  1.34s/it]tensor([[2.0625, 1.2500, 2.0625, 5.8750, 3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:32:58,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.85 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4688, -3.1094,  0.3945,  1.5938, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -4.6250, -1.5781,  2.6250, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1719, -1.1797,  1.7891,  1.3203, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.2656,  1.1328,  2.0000, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -2.8281,  1.3906,  2.8906, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -4.9688, -0.2930,  1.7109, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.6875,  0.1089,  2.6250, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:00,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:33:00,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.79 | bwd_microstep: 2085.54 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 2084.62 | step_microstep: 2.05
[2025-11-06 18:33:00,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.66 | bwd: 2086.26 | bwd_inner: 1.43 | bwd_allreduce: 2084.68 | step: 2.13
 56%|█████▌    | 1971/3507 [48:14<42:21,  1.65s/it]                                                   {'loss': 0.4683, 'learning_rate': 8.486311307671773e-06, 'epoch': 0.56}
 56%|█████▌    | 1971/3507 [48:14<42:21,  1.65s/it]tensor([[-4.7812, -1.8203,  1.7109, -0.6133, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3438, -2.8750, -2.3125,  0.9766, -0.4961]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -2.2812,  0.7930,  1.8438, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:00,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0938, -4.3438,  1.3125,  3.1875, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -0.1572,  2.5625, -3.4219, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -1.8828,  2.0000, -1.4531, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -2.0781,  1.2891,  1.5156, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0625, -4.0938,  0.9023,  1.7266, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:00,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:33:00,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.52 | bwd_microstep: 68.41 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 67.21 | step_microstep: 1.83
[2025-11-06 18:33:00,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.86 | bwd: 69.38 | bwd_inner: 1.99 | bwd_allreduce: 67.26 | step: 1.92
 56%|█████▌    | 1972/3507 [48:14<33:19,  1.30s/it]                                                   {'loss': 0.4059, 'learning_rate': 8.477181130971559e-06, 'epoch': 0.56}
 56%|█████▌    | 1972/3507 [48:14<33:19,  1.30s/it]tensor([[-3.2812, -3.5000, -0.8281,  3.3125, -1.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5000, -5.7188, -1.1016,  2.2031, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:00,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.70 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9375, -4.5000, -0.8594,  2.5781, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -3.1406, -1.2422,  3.3438, -0.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -2.7656,  1.6484, -0.4238, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -1.6172,  3.0312, -0.7461, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0938, -2.2031, -0.4316,  2.8750, -0.3164]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -5.1562, -1.3750,  2.4062, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:02,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.25
[2025-11-06 18:33:02,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.93 | bwd_microstep: 1230.37 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1229.18 | step_microstep: 3.77
[2025-11-06 18:33:02,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.66 | bwd: 1231.28 | bwd_inner: 1.94 | bwd_allreduce: 1229.22 | step: 3.85
 56%|█████▋    | 1973/3507 [48:16<35:20,  1.38s/it]                                                   {'loss': 0.4669, 'learning_rate': 8.46805225364692e-06, 'epoch': 0.56}
 56%|█████▋    | 1973/3507 [48:16<35:20,  1.38s/it]tensor([[-4.0000, -3.4219, -0.0869,  2.6562, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594,  0.8711,  2.9844, -1.8516, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:02,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.19 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-5.0625, -4.9062, -1.6562,  2.0156, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6875, -3.7656, -0.4355,  3.9531, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -1.9531,  1.9375,  1.3047, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7500, -5.0312,  0.2090,  1.8594, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -4.0625,  0.6758,  2.2344, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -1.2812,  2.6562,  0.5312, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:33:03,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:33:03,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.53 | bwd_microstep: 457.29 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 456.11 | step_microstep: 2.17
[2025-11-06 18:33:03,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.74 | bwd: 458.45 | bwd_inner: 2.09 | bwd_allreduce: 456.17 | step: 2.28
 56%|█████▋    | 1974/3507 [48:16<30:52,  1.21s/it]                                                   {'loss': 0.1506, 'learning_rate': 8.458924683487257e-06, 'epoch': 0.56}
 56%|█████▋    | 1974/3507 [48:16<30:52,  1.21s/it]tensor([[-3.7500, -2.7969,  1.0469,  3.2031, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9141,  1.7109,  1.8203, -0.8789, -1.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1250,  0.2793,  2.1250, -1.8047, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -5.1250, -0.2090,  3.2812, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:03,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.97 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6562,  0.8125,  4.3125, -1.7422, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.5938, -0.3164,  1.5156, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -0.6055,  3.2969, -0.3926, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -2.6094,  2.1250,  0.6484, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:05,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:33:05,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.11 | bwd_microstep: 2239.84 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2238.66 | step_microstep: 2.01
[2025-11-06 18:33:05,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 465.08 | bwd: 2240.81 | bwd_inner: 1.98 | bwd_allreduce: 2238.69 | step: 2.10
 56%|█████▋    | 1975/3507 [48:19<42:40,  1.67s/it]                                                   {'loss': 0.3853, 'learning_rate': 8.44979842828085e-06, 'epoch': 0.56}
 56%|█████▋    | 1975/3507 [48:19<42:40,  1.67s/it]tensor([[-1.0234,  2.4531,  3.3438, -1.5312, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.1875, -3.8594, -0.9609,  2.0938, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:06,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.73 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -2.0469,  1.6641, -0.6094, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -1.3125,  2.8750, -1.2266, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -5.9375, -3.0625,  2.1875, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -3.6875, -0.6406,  3.4688, -1.1953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -3.2969,  0.8203,  1.0234, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.8125, -5.4375,  0.0237,  2.5156, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:33:06,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:33:06,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.49 | bwd_microstep: 66.00 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 64.91 | step_microstep: 1.81
[2025-11-06 18:33:06,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.23 | bwd: 66.75 | bwd_inner: 1.64 | bwd_allreduce: 64.95 | step: 1.89
 56%|█████▋    | 1976/3507 [48:20<33:27,  1.31s/it]                                                   {'loss': 0.301, 'learning_rate': 8.440673495814862e-06, 'epoch': 0.56}
 56%|█████▋    | 1976/3507 [48:20<33:27,  1.31s/it]tensor([[-4.5000, -0.4375,  3.1875, -1.5859, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:06,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.78 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.1562, -4.2500, -0.2930,  0.2969, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -2.5938,  1.6328,  0.3027, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -4.9062,  0.6094,  2.5312, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594, -0.5195,  3.0312,  2.2031, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -1.9766,  1.9297,  1.7422, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -0.9062,  3.1719, -0.0275, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.3906,  0.3125,  1.3750, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:10,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:33:10,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.13 | bwd_microstep: 3309.25 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 3308.35 | step_microstep: 122.96
[2025-11-06 18:33:10,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.92 | bwd: 3310.30 | bwd_inner: 1.71 | bwd_allreduce: 3308.42 | step: 123.07
 56%|█████▋    | 1977/3507 [48:23<52:21,  2.05s/it]                                                   {'loss': 0.7295, 'learning_rate': 8.431549893875319e-06, 'epoch': 0.56}
 56%|█████▋    | 1977/3507 [48:23<52:21,  2.05s/it]tensor([[-3.7812, -4.0625, -1.8984,  2.0156, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -1.7500,  1.5391, -3.6719, -5.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:10,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.19 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.1250, -2.3906,  2.4062, -1.0469, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -4.1562,  0.2109,  2.6406, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4688, -2.7969, -1.4453,  1.9844, -0.5586]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.4844, -3.1406, -0.9570,  3.8125, -0.1494]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -1.6484,  2.3750,  0.1553, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -0.9688,  2.6094, -0.5547, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:33:10,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:33:10,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.95 | bwd_microstep: 167.86 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 166.68 | step_microstep: 1.57
[2025-11-06 18:33:10,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.17 | bwd: 168.87 | bwd_inner: 2.02 | bwd_allreduce: 166.73 | step: 1.65
 56%|█████▋    | 1978/3507 [48:24<41:02,  1.61s/it]                                                   {'loss': 0.3899, 'learning_rate': 8.42242763024712e-06, 'epoch': 0.56}
 56%|█████▋    | 1978/3507 [48:24<41:02,  1.61s/it]tensor([[-5.9375, -4.1875,  0.7031,  1.5938, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:10,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.77 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4375, -3.2188,  1.1953,  1.2031, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -3.4219, -1.2891,  2.1406, -1.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -4.6250, -0.5078,  1.5703, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562,  0.1309,  3.2969,  0.3516, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -4.7812, -0.9102,  2.9844, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -4.3750, -1.3203, -2.1406, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -4.4375, -0.5508,  2.6406, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:12,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:33:12,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.28 | bwd_microstep: 1764.89 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1763.82 | step_microstep: 1.94
[2025-11-06 18:33:12,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.08 | bwd: 1765.92 | bwd_inner: 1.93 | bwd_allreduce: 1763.86 | step: 2.03
 56%|█████▋    | 1979/3507 [48:26<45:02,  1.77s/it]                                                   {'loss': 0.4428, 'learning_rate': 8.413306712714014e-06, 'epoch': 0.56}
 56%|█████▋    | 1979/3507 [48:26<45:02,  1.77s/it]tensor([[-3.7344, -3.9219, -0.4414,  4.1250, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -5.4062, -0.9609,  1.4375, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -3.9375, -0.0522,  2.6094, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -4.1562, -2.5000,  1.2969, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:33:13,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.77 | bwd_microstep: 1.36 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.8906, -1.9688,  2.0000,  1.9453, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5000,  1.0859,  3.6250, -3.2031, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -2.4375,  1.0000, -0.2090, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -4.2500,  0.4941,  1.3672, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:13,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:33:13,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.26 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.97
[2025-11-06 18:33:13,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.06 | bwd: 3.30 | bwd_inner: 2.30 | bwd_allreduce: 0.86 | step: 2.07
 56%|█████▋    | 1980/3507 [48:27<35:07,  1.38s/it]                                                   {'loss': 0.8604, 'learning_rate': 8.40418714905861e-06, 'epoch': 0.56}
 56%|█████▋    | 1980/3507 [48:27<35:07,  1.38s/it]tensor([[-4.5938, -1.0234,  2.7812, -0.7539, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:13,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.22 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0000, -3.9844,  0.1514,  2.5469, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -4.2500, -1.1797,  2.6562, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -3.9688, -0.9141,  2.4062, -1.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -4.0625,  0.3574,  1.9062, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -4.7500, -0.5508,  4.3125, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -2.7188,  1.0547,  1.9844, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -3.9531,  1.4453,  1.2031, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:16,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.20 | optimizer_step: 0.17
[2025-11-06 18:33:16,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 2504.67 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2503.43 | step_microstep: 1.89
[2025-11-06 18:33:16,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.81 | bwd: 2505.70 | bwd_inner: 2.09 | bwd_allreduce: 2503.47 | step: 1.97
 56%|█████▋    | 1981/3507 [48:30<46:42,  1.84s/it]                                                   {'loss': 0.4804, 'learning_rate': 8.395068947062354e-06, 'epoch': 0.56}
 56%|█████▋    | 1981/3507 [48:30<46:42,  1.84s/it]tensor([[-5.6875, -3.7344,  0.9102,  1.6641, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906, -1.4453,  2.3281,  2.7656, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2500,  1.2656,  2.5000, -1.9219, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6250, -2.8906,  0.3867,  2.6406, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:16,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.35 | bwd_microstep: 5.81 | bwd_inner_microstep: 5.65 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.5625, -3.9219, -0.0457,  3.0156, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -4.4375, -0.3945,  3.2812, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -4.9062, -0.8555,  3.1094, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -4.1562, -0.1855,  2.8281, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:16,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:33:16,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.28 | bwd_microstep: 1.41 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.63 | step_microstep: 2.19
[2025-11-06 18:33:16,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.65 | bwd: 7.22 | bwd_inner: 6.38 | bwd_allreduce: 0.68 | step: 2.29
 57%|█████▋    | 1982/3507 [48:30<36:05,  1.42s/it]                                                   {'loss': 0.3306, 'learning_rate': 8.385952114505537e-06, 'epoch': 0.57}
 57%|█████▋    | 1982/3507 [48:30<36:05,  1.42s/it]tensor([[-4.9062, -2.7188,  1.3984,  0.9648, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.2812, -1.2031,  2.4531,  0.2275, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([3], device='cuda:2')
tensor([[-3.4688,  0.4668,  2.2500, -2.6562, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -3.7812,  0.2617,  3.1875, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:16,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.95 | bwd_microstep: 4.24 | bwd_inner_microstep: 4.09 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.0000, -0.0077,  4.1250, -0.5078, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -4.3750, -1.4062,  2.5000, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6875, -5.2500,  0.0967,  2.2812, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -4.1562, -0.5273,  3.3281, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:18,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:33:18,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.01 | bwd_microstep: 1094.46 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1093.39 | step_microstep: 1.92
[2025-11-06 18:33:18,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.98 | bwd: 1098.69 | bwd_inner: 5.08 | bwd_allreduce: 1093.44 | step: 2.01
 57%|█████▋    | 1983/3507 [48:32<36:42,  1.45s/it]                                                   {'loss': 0.4117, 'learning_rate': 8.37683665916728e-06, 'epoch': 0.57}
 57%|█████▋    | 1983/3507 [48:32<36:42,  1.45s/it]tensor([[-4.9062, -3.0469,  1.5391,  2.2500, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -0.2188,  3.5625, -1.2188, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:18,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.22 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.4062, -3.5469, -0.4082,  1.9375, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -3.3438, -0.1523,  3.9219, -0.9648]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.0625, -6.3125, -1.5000,  0.0106, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -0.7070,  3.2812, -0.7344, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -2.8750,  0.4062,  3.9375, -0.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -5.1875, -1.6406,  2.5625, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:18,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:33:18,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.81 | bwd_microstep: 329.39 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 328.35 | step_microstep: 2.00
[2025-11-06 18:33:18,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.06 | bwd: 330.19 | bwd_inner: 1.68 | bwd_allreduce: 328.39 | step: 2.07
 57%|█████▋    | 1984/3507 [48:32<31:38,  1.25s/it]                                                   {'loss': 0.1979, 'learning_rate': 8.36772258882552e-06, 'epoch': 0.57}
 57%|█████▋    | 1984/3507 [48:32<31:38,  1.25s/it]tensor([[ 0.1943,  3.1406,  3.2812, -0.9961, -0.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5000, -0.1240,  2.5312, -1.5156, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:19,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.23 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2500, -5.2188, -1.5938,  2.5625, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -4.8438, -0.3379,  2.3125, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -3.5781,  0.7461,  1.6250, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.4375, -6.7812, -0.7930,  1.4297, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -2.3438,  2.3281, -0.5078, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -2.7812,  2.0781,  0.8672, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:21,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:33:21,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.37 | bwd_microstep: 2490.25 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2489.04 | step_microstep: 2.23
[2025-11-06 18:33:21,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.61 | bwd: 2491.08 | bwd_inner: 1.86 | bwd_allreduce: 2489.08 | step: 2.31
 57%|█████▋    | 1985/3507 [48:35<43:26,  1.71s/it]                                                   {'loss': 0.2227, 'learning_rate': 8.358609911257023e-06, 'epoch': 0.57}
 57%|█████▋    | 1985/3507 [48:35<43:26,  1.71s/it]tensor([[-5.0625, -3.8594,  0.3594,  2.5469, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -2.7812,  1.5234,  2.2031, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:21,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.61 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7500, -3.0156, -0.4648,  1.8516, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -4.9375, -0.6562,  3.8750, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -3.0938,  0.6172,  3.0781, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -3.9688,  0.9766,  2.6406, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -3.0156,  1.0547,  0.9688, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -2.7500,  1.8828,  1.6016, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:22,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:33:22,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.23 | bwd_microstep: 550.30 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 549.38 | step_microstep: 2.13
[2025-11-06 18:33:22,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.86 | bwd: 551.14 | bwd_inner: 1.53 | bwd_allreduce: 549.44 | step: 2.22
 57%|█████▋    | 1986/3507 [48:36<37:36,  1.48s/it]                                                   {'loss': 0.3066, 'learning_rate': 8.349498634237366e-06, 'epoch': 0.57}
 57%|█████▋    | 1986/3507 [48:36<37:36,  1.48s/it]tensor([[-4.7812, -3.4531,  0.8438,  2.6094, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.0938,  1.1797,  2.5781, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -4.8438, -0.8203,  1.7344, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:22,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.28 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.4062, -2.2969,  1.6953, -0.8828, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5000, -4.9062, -0.7188,  0.6602, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -3.9531, -0.1406,  2.4531, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8125, -4.6250,  0.0183,  2.5781, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -2.1719,  3.1250, -0.8477, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:24,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:33:24,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.01 | bwd_microstep: 1586.68 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1585.55 | step_microstep: 2.37
[2025-11-06 18:33:24,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.31 | bwd: 1587.69 | bwd_inner: 1.90 | bwd_allreduce: 1585.62 | step: 2.48
 57%|█████▋    | 1987/3507 [48:38<41:35,  1.64s/it]                                                   {'loss': 0.4425, 'learning_rate': 8.340388765540923e-06, 'epoch': 0.57}
 57%|█████▋    | 1987/3507 [48:38<41:35,  1.64s/it]tensor([[-5.9062e+00, -3.1406e+00,  1.1953e+00,  4.6082e-03, -4.6250e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:24,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.26 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6406,  0.6953,  4.8438,  1.3594, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -0.9414,  2.0469, -1.1016, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -2.2188,  3.2656,  0.1562, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1094,  1.0859,  2.8594, -1.2344, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5938, -3.9219,  0.4316,  1.4688, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8906, -1.3828,  0.9023,  1.0469, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4844,  1.2500,  3.6719, -1.1719, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:33:25,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:33:25,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.08 | bwd_microstep: 318.71 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 317.79 | step_microstep: 1.91
[2025-11-06 18:33:25,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.37 | bwd: 319.40 | bwd_inner: 1.44 | bwd_allreduce: 317.83 | step: 1.99
 57%|█████▋    | 1988/3507 [48:39<34:10,  1.35s/it]                                                   {'loss': 0.5891, 'learning_rate': 8.331280312940872e-06, 'epoch': 0.57}
 57%|█████▋    | 1988/3507 [48:39<34:10,  1.35s/it]tensor([[1.6953, 4.2188, 4.1875, 0.6523, 0.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-7.3125, -5.9062, -0.4980,  2.0625, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:25,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.18 | bwd_microstep: 2.20 | bwd_inner_microstep: 2.07 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.5312, -6.0625, -2.4375,  1.0625, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-8.3750, -7.2188, -2.0781,  0.8711, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-1.4297, -2.0469, -0.7305,  3.1875,  0.4785]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125, -2.5469,  0.5508,  2.6250, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.0625, -7.0312, -2.1094, -1.3125, -6.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -2.8438,  1.0703,  3.9688, -1.5391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:27,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:33:27,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.99 | bwd_microstep: 1339.46 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 1337.83 | step_microstep: 2.05
[2025-11-06 18:33:27,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.19 | bwd: 1341.66 | bwd_inner: 3.57 | bwd_allreduce: 1337.90 | step: 2.15
 57%|█████▋    | 1989/3507 [48:40<37:07,  1.47s/it]                                                   {'loss': 0.1983, 'learning_rate': 8.322173284209187e-06, 'epoch': 0.57}
 57%|█████▋    | 1989/3507 [48:40<37:07,  1.47s/it]tensor([[-4.5625, -3.0781,  1.2109,  2.6406, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -3.8594, -0.1118,  2.4844, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.3125, -6.9688, -1.2109,  1.4844, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:27,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.32 | bwd_microstep: 1.54 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.16
tensor([[-3.6875, -4.2500, -2.2812,  2.0469, -1.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.4531,  2.0938,  1.5000, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.9609, -2.7344, -2.4219,  1.0000, -0.1318]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-5.2500, -3.5156,  0.8945,  1.6016, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -1.2031,  0.8984, -0.3223, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:28,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.33
[2025-11-06 18:33:28,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.56 | bwd_microstep: 1171.22 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1170.33 | step_microstep: 2.70
[2025-11-06 18:33:28,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.92 | bwd: 1172.79 | bwd_inner: 2.14 | bwd_allreduce: 1170.43 | step: 2.87
 57%|█████▋    | 1990/3507 [48:42<38:33,  1.52s/it]                                                   {'loss': 0.3935, 'learning_rate': 8.313067687116618e-06, 'epoch': 0.57}
 57%|█████▋    | 1990/3507 [48:42<38:33,  1.52s/it]tensor([[-4.7812, -2.5938,  1.2344,  0.8672, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250, -3.7656, -1.7109,  2.6562, -0.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.6250, -0.8828,  2.4062, -1.6797, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -0.1660,  4.0000, -1.9609, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:29,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 311.61 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2500, -1.9609,  2.5938,  0.2617, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -5.1250, -1.6719,  2.4688, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.1250,  0.7188,  2.6406, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -6.0312, -3.8750,  0.6133, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:33:30,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.08 | optimizer_gradients: 0.63 | optimizer_step: 0.27
[2025-11-06 18:33:30,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.07 | bwd_microstep: 1704.20 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 1702.62 | step_microstep: 7.02
[2025-11-06 18:33:30,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 452.72 | bwd: 1704.98 | bwd_inner: 2.11 | bwd_allreduce: 1702.66 | step: 7.11
 57%|█████▋    | 1991/3507 [48:44<43:41,  1.73s/it]                                                   {'loss': 1.0187, 'learning_rate': 8.303963529432695e-06, 'epoch': 0.57}
 57%|█████▋    | 1991/3507 [48:44<43:41,  1.73s/it]tensor([[-5.7188, -5.3750, -1.5469,  1.8828, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:31,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.43 | bwd_microstep: 7.29 | bwd_inner_microstep: 7.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.7188, -2.9531,  1.0625,  1.5938, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -1.8125,  2.9531, -0.2539, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656, -3.6406, -2.3750,  2.0781, -0.4707]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.3125,  0.5781,  2.4688, -2.7969, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4062, -3.9844, -0.3496,  3.0156, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -4.7812, -1.2656,  2.1562, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -3.4531,  1.0234,  0.5547, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:33,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:33:33,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.76 | bwd_microstep: 2315.76 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 2314.96 | step_microstep: 2.68
[2025-11-06 18:33:33,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.21 | bwd: 2323.05 | bwd_inner: 7.88 | bwd_allreduce: 2315.02 | step: 2.79
 57%|█████▋    | 1992/3507 [48:47<51:14,  2.03s/it]                                                   {'loss': 0.7953, 'learning_rate': 8.294860818925726e-06, 'epoch': 0.57}
 57%|█████▋    | 1992/3507 [48:47<51:14,  2.03s/it]tensor([[-3.1094,  0.1147,  3.0781, -0.0635, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -2.5938,  0.6602,  1.8750, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -0.3027,  2.1406, -1.3672, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:33,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.81 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -1.2969,  1.0000, -3.2344, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.0000, -4.6250,  1.5547, -0.2236, -6.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -4.0000,  0.1943,  1.9062, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -2.9375,  1.4688,  0.9648, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -4.3750, -0.7930,  2.6875, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:33:34,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:33:34,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.35 | bwd_microstep: 51.91 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 50.80 | step_microstep: 2.15
[2025-11-06 18:33:34,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.18 | bwd: 52.90 | bwd_inner: 1.95 | bwd_allreduce: 50.83 | step: 2.23
 57%|█████▋    | 1993/3507 [48:48<39:20,  1.56s/it]                                                   {'loss': 0.1827, 'learning_rate': 8.285759563362778e-06, 'epoch': 0.57}
 57%|█████▋    | 1993/3507 [48:48<39:20,  1.56s/it]tensor([[-3.8438, -1.0625,  1.8047, -0.4512, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.8438, -4.0938, -0.1064,  2.8125, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -4.2812,  1.4453,  1.7109, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625,  1.0703,  4.6875, -0.3613, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:34,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.68 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7188,  0.7852,  3.2188, -0.5508, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -2.7344,  1.5469,  1.8281, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1406,  1.1562,  3.8438, -2.0312, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625,  1.0625,  4.2188, -1.2578, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:37,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.23 | optimizer_step: 0.32
[2025-11-06 18:33:37,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.88 | bwd_microstep: 2811.97 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 2810.76 | step_microstep: 59.05
[2025-11-06 18:33:37,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 482.59 | bwd: 2812.90 | bwd_inner: 1.95 | bwd_allreduce: 2810.81 | step: 59.13
 57%|█████▋    | 1994/3507 [48:51<53:13,  2.11s/it]                                                   {'loss': 0.5949, 'learning_rate': 8.276659770509685e-06, 'epoch': 0.57}
 57%|█████▋    | 1994/3507 [48:51<53:13,  2.11s/it]tensor([[-6.5312, -2.5938,  2.4062, -1.5156, -5.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.0625, -8.7500, -5.0000, -1.0078, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:37,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.10 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.5469, -2.2031,  1.5391,  3.2500, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -0.9609,  1.1875, -0.4609, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.8906,  0.2891,  4.1875, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.6719,  1.4766,  1.1875, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2344, -0.6758,  1.7422,  0.2061, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -2.7969,  1.5156,  1.2734, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:37,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:33:37,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.48 | bwd_microstep: 53.98 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 53.15 | step_microstep: 2.22
[2025-11-06 18:33:37,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.61 | bwd: 54.78 | bwd_inner: 1.43 | bwd_allreduce: 53.19 | step: 2.30
 57%|█████▋    | 1995/3507 [48:51<40:19,  1.60s/it]                                                   {'loss': 0.4719, 'learning_rate': 8.267561448131016e-06, 'epoch': 0.57}
 57%|█████▋    | 1995/3507 [48:51<40:19,  1.60s/it]tensor([[-6.0312, -1.5547,  3.2188, -2.0781, -5.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:38,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.32 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3750, -3.4688,  1.1641,  1.7891, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -1.3359,  2.4531,  2.3750, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -0.3125,  1.7422, -2.6250, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.8125e+00, -5.4375e+00,  2.1667e-03,  2.3281e+00, -4.2188e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -1.6953,  2.0156,  0.4434, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5312, -4.0625,  0.4258, -0.1050, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281,  1.0469,  4.1875, -1.7891, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:38,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:33:38,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.74 | bwd_microstep: 163.64 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 162.65 | step_microstep: 1.82
[2025-11-06 18:33:38,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.07 | bwd: 164.51 | bwd_inner: 1.67 | bwd_allreduce: 162.70 | step: 1.90
 57%|█████▋    | 1996/3507 [48:52<32:30,  1.29s/it]                                                   {'loss': 0.5936, 'learning_rate': 8.258464603990103e-06, 'epoch': 0.57}
 57%|█████▋    | 1996/3507 [48:52<32:30,  1.29s/it]tensor([[-4.4062, -1.1953,  1.6641, -1.4219, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -3.2656,  1.6562,  1.8203, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -6.1250, -3.6094,  0.9688, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:38,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.44 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.9062, -3.0000,  1.4531,  2.3281, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -0.3477,  2.9062, -0.7148, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6250, -2.4531, -0.0894,  2.9531, -0.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719, -3.0156, -0.8242,  3.1719, -0.5117]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.6875, -0.9297,  2.9844, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:39,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:33:39,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.97 | bwd_microstep: 1.49 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.64 | step_microstep: 1.36
[2025-11-06 18:33:39,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.44 | bwd: 2.41 | bwd_inner: 1.59 | bwd_allreduce: 0.68 | step: 1.46
 57%|█████▋    | 1997/3507 [48:52<26:11,  1.04s/it]                                                   {'loss': 0.2959, 'learning_rate': 8.249369245849007e-06, 'epoch': 0.57}
 57%|█████▋    | 1997/3507 [48:52<26:11,  1.04s/it]tensor([[-3.4375, -2.2656,  0.7578,  2.0000, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750,  0.8750,  3.8906, -1.6953, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:39,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.98 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -3.1875,  1.9922,  1.3047, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -6.6562, -2.8750,  1.4219, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -4.8438,  0.7344,  2.2188, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -2.3281,  2.0469,  2.2500, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -4.1562,  0.0442,  0.0226, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -0.9648,  3.2188, -0.0933, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:33:40,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:33:40,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.98 | bwd_microstep: 1367.12 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 1365.81 | step_microstep: 1.98
[2025-11-06 18:33:40,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.99 | bwd: 1368.14 | bwd_inner: 2.16 | bwd_allreduce: 1365.85 | step: 2.07
 57%|█████▋    | 1998/3507 [48:54<31:37,  1.26s/it]                                                   {'loss': 0.3799, 'learning_rate': 8.240275381468528e-06, 'epoch': 0.57}
 57%|█████▋    | 1998/3507 [48:54<31:37,  1.26s/it]tensor([[-3.1250,  0.6211,  3.4688, -1.2188, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:40,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.41 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.3125, -5.8438, -0.9805,  0.8828, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625,  0.0496,  3.0312, -2.2812, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[1.4375, 1.9609, 4.4688, 6.7188, 2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -0.8320,  1.5625, -0.3613, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2812,  0.5938,  2.2500, -0.4766, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.6562, -4.7500, -0.1963,  2.1562, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3750,  1.4297,  3.2812, -1.6328, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:33:42,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:33:42,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 52.03 | bwd_microstep: 306.89 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 305.89 | step_microstep: 2.08
[2025-11-06 18:33:42,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 185.45 | bwd: 307.86 | bwd_inner: 1.80 | bwd_allreduce: 305.93 | step: 2.16
 57%|█████▋    | 1999/3507 [48:56<32:58,  1.31s/it]                                                   {'loss': 0.6286, 'learning_rate': 8.231183018608184e-06, 'epoch': 0.57}
 57%|█████▋    | 1999/3507 [48:56<32:58,  1.31s/it]tensor([[-4.4375, -3.9375, -0.3613,  2.6719, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4844,  1.0156,  2.3281, -2.0312, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9219,  0.0199,  2.2500, -0.5703, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:42,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.14 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.3125, -4.6875, -0.3887,  2.8594, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.8750, -6.2500, -0.4922,  1.6562, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5781,  2.0000,  4.0625, -0.5469, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6562, -2.9531,  1.1484,  1.8516, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -4.6250, -0.7773,  2.9062, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:33:43,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:33:43,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.90 | bwd_microstep: 1094.84 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1093.72 | step_microstep: 2.02
[2025-11-06 18:33:43,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.06 | bwd: 1096.16 | bwd_inner: 2.21 | bwd_allreduce: 1093.78 | step: 2.13
 57%|█████▋    | 2000/3507 [48:57<34:58,  1.39s/it]                                                   {'loss': 0.41, 'learning_rate': 8.222092165026218e-06, 'epoch': 0.57}
 57%|█████▋    | 2000/3507 [48:57<34:58,  1.39s/it]tensor([[-5.7812, -1.4453,  3.2812, -1.6172, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -3.2500,  2.0312,  0.2539, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:43,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.47 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6250, -2.5312,  1.3984,  1.3594, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -4.1875,  0.0742,  3.4062, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -5.0625,  0.4199,  2.1250, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.3750, -0.6523,  1.9531, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3516,  2.4219,  3.0000, -2.6250, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3750e+00, -4.4062e+00, -1.6251e-03,  2.5938e+00, -3.0156e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:33:45,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.27 | optimizer_step: 0.26
[2025-11-06 18:33:45,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.78 | bwd_microstep: 114.29 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 113.41 | step_microstep: 2.81
[2025-11-06 18:33:45,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 397.24 | bwd: 115.25 | bwd_inner: 1.65 | bwd_allreduce: 113.46 | step: 2.88
 57%|█████▋    | 2001/3507 [48:58<33:56,  1.35s/it]                                                   {'loss': 0.5375, 'learning_rate': 8.213002828479574e-06, 'epoch': 0.57}
 57%|█████▋    | 2001/3507 [48:58<33:56,  1.35s/it]tensor([[-3.7969, -0.8945,  1.6250, -1.0000, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406, -2.9219, -1.7031,  1.4453, -0.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:45,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.82 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1250, -4.0312,  0.2598,  2.5000, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2500, -2.7500,  2.0469, -0.6562, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1406, -3.5469, -1.8281,  1.9062, -0.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -3.9062, -2.0000,  1.2891, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0781, -1.0469,  2.4531,  1.7734, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -5.8438, -1.1875,  2.6094, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:46,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:33:46,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.37 | bwd_microstep: 1039.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1038.04 | step_microstep: 1.63
[2025-11-06 18:33:46,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.22 | bwd: 1039.80 | bwd_inner: 1.60 | bwd_allreduce: 1038.07 | step: 1.70
 57%|█████▋    | 2002/3507 [49:00<34:08,  1.36s/it]                                                   {'loss': 0.1438, 'learning_rate': 8.203915016723919e-06, 'epoch': 0.57}
 57%|█████▋    | 2002/3507 [49:00<34:08,  1.36s/it]tensor([[-5.3750, -4.0938,  0.5156,  2.5625, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:46,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.87 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.7500, -2.4844,  1.4219,  0.5977, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7812, -6.9062, -2.5938,  0.1465, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.4844,  1.6094,  0.7227, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9844,  2.1875,  2.2031, -1.9062, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-7.5312, -5.5625,  0.2637,  1.4375, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -4.4375, -0.3164,  2.4062, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -2.6094,  1.5703, -0.0063, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:48,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.23
[2025-11-06 18:33:48,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.67 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.95 | step_microstep: 2.35
[2025-11-06 18:33:48,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.54 | bwd: 2.99 | bwd_inner: 1.85 | bwd_allreduce: 0.99 | step: 2.43
 57%|█████▋    | 2003/3507 [49:02<40:01,  1.60s/it]                                                   {'loss': 0.8353, 'learning_rate': 8.194828737513606e-06, 'epoch': 0.57}
 57%|█████▋    | 2003/3507 [49:02<40:01,  1.60s/it]tensor([[-4.2188, -0.2148,  3.6406, -0.9414, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8594, -0.2188,  2.2500, -1.8281, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -0.5234,  2.0312, -2.5938, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1562,  1.5938,  4.6250,  0.0977, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:48,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.05 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0625, -4.0625, -0.6289,  3.5625, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -4.4062,  0.9727,  2.4531, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6250,  1.8438,  2.7500, -1.6562, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.6562, -4.9062, -1.4766,  3.2500, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:33:49,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:33:49,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.36 | bwd_microstep: 12.91 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 12.08 | step_microstep: 2.09
[2025-11-06 18:33:49,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.44 | bwd: 13.58 | bwd_inner: 1.32 | bwd_allreduce: 12.12 | step: 2.17
 57%|█████▋    | 2004/3507 [49:02<31:18,  1.25s/it]                                                   {'loss': 0.2235, 'learning_rate': 8.185743998601681e-06, 'epoch': 0.57}
 57%|█████▋    | 2004/3507 [49:02<31:18,  1.25s/it]tensor([[-1.5547, -0.3496,  1.9453,  2.6406, -0.6289]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:49,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.83 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.6406,  2.2656,  3.3281, -2.2500, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2266,  2.3594,  2.6875, -2.6562, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -4.8750, -1.1719,  2.3125, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -2.0156,  2.9531,  0.0305, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9844, -3.2656,  0.2295,  2.6250, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -3.1562,  1.8906,  0.9766, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -4.2500,  1.5156,  0.3281, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:33:52,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:33:52,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.61 | bwd_microstep: 3303.04 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 3301.90 | step_microstep: 2.55
[2025-11-06 18:33:52,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.46 | bwd: 3304.06 | bwd_inner: 1.93 | bwd_allreduce: 3301.96 | step: 2.64
 57%|█████▋    | 2005/3507 [49:06<49:49,  1.99s/it]                                                   {'loss': 0.2688, 'learning_rate': 8.176660807739886e-06, 'epoch': 0.57}
 57%|█████▋    | 2005/3507 [49:06<49:49,  1.99s/it]tensor([[-2.5312,  1.7500,  3.5000, -3.0156, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:33:52,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 72.44 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6719, -0.7031,  3.5938,  1.3047, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -4.6562, -1.7656,  2.9688, -1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -4.9688, -0.7383,  2.2188, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -2.6562,  2.2344,  1.3828, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5625, -5.4688, -1.4531,  0.5977, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -3.0781,  0.3770,  2.5000, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.6406,  0.2539,  2.6719, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:33:53,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.17
[2025-11-06 18:33:53,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.29 | bwd_microstep: 149.86 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 148.84 | step_microstep: 2.11
[2025-11-06 18:33:53,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.75 | bwd: 150.56 | bwd_inner: 1.52 | bwd_allreduce: 148.88 | step: 2.19
 57%|█████▋    | 2006/3507 [49:07<38:21,  1.53s/it]                                                   {'loss': 0.3904, 'learning_rate': 8.16757917267863e-06, 'epoch': 0.57}
 57%|█████▋    | 2006/3507 [49:07<38:21,  1.53s/it]tensor([[-5.5000, -1.9453,  2.4219, -0.7031, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.3750, -6.8125, -1.7500,  2.3125, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:53,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.64 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.3750, -2.3438,  1.8359,  1.9688, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5625, -5.1562,  0.4473,  2.7500, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2344, -3.4062, -0.6445,  3.3594, -0.9336]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -2.4531,  1.5938,  1.0859, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -1.7344,  3.3438, -0.5625, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -1.0547,  3.7031, -1.2578, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:33:55,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:33:55,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.17 | bwd_microstep: 2015.71 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2014.60 | step_microstep: 2.05
[2025-11-06 18:33:55,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.83 | bwd: 2016.53 | bwd_inner: 1.70 | bwd_allreduce: 2014.65 | step: 2.15
 57%|█████▋    | 2007/3507 [49:09<45:00,  1.80s/it]                                                   {'loss': 0.2522, 'learning_rate': 8.158499101166997e-06, 'epoch': 0.57}
 57%|█████▋    | 2007/3507 [49:09<45:00,  1.80s/it]tensor([[-4.6250, -4.0625, -0.2285,  2.9688, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -0.0261,  2.4531, -1.7656, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:33:55,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.40 | bwd_microstep: 2.97 | bwd_inner_microstep: 2.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-1.9844, -2.8125, -1.9141,  2.1250,  0.0527]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -4.1250, -0.7461,  3.3438, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.6875,  0.3281,  2.0938, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594, -2.5625,  1.1797,  3.7031, -1.4453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4062, -4.7188,  0.8945,  0.3711, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -1.8906,  3.2188,  0.0369, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:33:56,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:33:56,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.54 | bwd_microstep: 130.44 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 129.55 | step_microstep: 2.04
[2025-11-06 18:33:56,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.94 | bwd: 133.42 | bwd_inner: 3.64 | bwd_allreduce: 129.61 | step: 2.15
 57%|█████▋    | 2008/3507 [49:09<34:51,  1.40s/it]                                                   {'loss': 0.4221, 'learning_rate': 8.149420600952744e-06, 'epoch': 0.57}
 57%|█████▋    | 2008/3507 [49:09<34:51,  1.40s/it]tensor([[-4.5938, -3.6406, -0.2617,  2.0156, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -3.0625,  0.5938,  2.9844, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:56,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.59 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4844, -1.0469,  2.1250,  0.8594, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4062, -0.4277,  3.5000, -1.0859, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.7500, -4.5000,  1.2891, -0.2266, -6.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4844,  0.4102,  2.2969, -0.7773, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[ 0.2949,  3.7969,  3.8750, -1.3125, -1.0234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.6406, -3.3281, -2.0156,  2.2031, -0.4043]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:33:57,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:33:57,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.82 | bwd_microstep: 1205.01 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1203.83 | step_microstep: 2.83
[2025-11-06 18:33:57,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.44 | bwd: 1205.67 | bwd_inner: 1.63 | bwd_allreduce: 1203.89 | step: 2.91
 57%|█████▋    | 2009/3507 [49:11<36:48,  1.47s/it]                                                   {'loss': 0.6468, 'learning_rate': 8.14034367978228e-06, 'epoch': 0.57}
 57%|█████▋    | 2009/3507 [49:11<36:48,  1.47s/it]tensor([[-4.5625, -4.2812, -0.7539,  2.7969, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -4.1250, -0.2930,  2.5938, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:57,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.40 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.6875, -4.7188, -0.4336,  2.1406, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -4.3438, -0.9141,  3.0000, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -4.9375, -0.3867,  2.8750, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0703, -1.7734, -1.0859,  2.4844,  0.6836]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0547, -1.8828, -1.3906,  2.3125,  0.6992]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -2.6875,  1.2031,  2.5625, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:33:59,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:33:59,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.35 | bwd_microstep: 804.71 | bwd_inner_microstep: 11.47 | bwd_allreduce_microstep: 793.16 | step_microstep: 1.98
[2025-11-06 18:33:59,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.77 | bwd: 805.47 | bwd_inner: 12.13 | bwd_allreduce: 793.20 | step: 2.06
 57%|█████▋    | 2010/3507 [49:12<35:08,  1.41s/it]                                                   {'loss': 0.4395, 'learning_rate': 8.13126834540067e-06, 'epoch': 0.57}
 57%|█████▋    | 2010/3507 [49:12<35:08,  1.41s/it]tensor([[-6.4375, -5.7812, -1.2500,  2.2656, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:33:59,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.36 | bwd_microstep: 1.34 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -4.6875,  0.4199,  2.1562, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -3.0625,  1.1328,  2.4688, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -5.0000, -1.1172,  2.3281, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.4531,  1.0312,  1.2109, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -2.3438,  1.3750, -1.4062, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688,  0.8047,  3.6562, -0.8750, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -3.2031,  0.9297,  3.2500, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:01,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:34:01,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.33 | bwd_microstep: 1818.96 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 1817.46 | step_microstep: 2.05
[2025-11-06 18:34:01,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.72 | bwd: 1820.29 | bwd_inner: 2.66 | bwd_allreduce: 1817.50 | step: 2.13
 57%|█████▋    | 2011/3507 [49:15<41:23,  1.66s/it]                                                   {'loss': 0.4075, 'learning_rate': 8.122194605551625e-06, 'epoch': 0.57}
 57%|█████▋    | 2011/3507 [49:15<41:23,  1.66s/it]tensor([[-3.8594, -2.9531,  0.2402,  2.1094, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094, -2.8594, -1.3516,  1.8906, -0.7461]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -2.3750,  1.2656, -0.1514, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:01,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.43 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4375,  1.7891,  2.5781, -1.4609, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5938, -6.0625, -1.1797,  2.7344, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -3.2031,  1.7734,  0.8906, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812, -3.0938,  0.0383,  3.3594, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.5156,  0.4883,  1.3672, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:02,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:34:02,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.33 | bwd_microstep: 592.69 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 591.69 | step_microstep: 1.48
[2025-11-06 18:34:02,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 193.78 | bwd: 593.66 | bwd_inner: 1.81 | bwd_allreduce: 591.73 | step: 1.56
 57%|█████▋    | 2012/3507 [49:15<35:01,  1.41s/it]                                                   {'loss': 0.3251, 'learning_rate': 8.113122467977491e-06, 'epoch': 0.57}
 57%|█████▋    | 2012/3507 [49:15<35:01,  1.41s/it]tensor([[-5.5312, -4.5312, -0.3164,  1.7891, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -2.6719,  0.9609,  2.6250, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:02,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.23 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.2188, -5.5312, -1.3828,  1.7344, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -3.8750, -0.8672,  4.5625, -0.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.6875, -4.7188,  1.4219,  0.2988, -5.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500,  0.3105,  2.6719, -2.5625, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.4375, -4.3750, -0.4570,  1.5859, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531,  0.6602,  3.6094, -2.7656, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:03,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:34:03,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.04 | bwd_microstep: 1103.41 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1102.36 | step_microstep: 2.21
[2025-11-06 18:34:03,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.30 | bwd: 1104.38 | bwd_inner: 1.84 | bwd_allreduce: 1102.41 | step: 2.30
 57%|█████▋    | 2013/3507 [49:17<36:15,  1.46s/it]                                                   {'loss': 0.5516, 'learning_rate': 8.104051940419251e-06, 'epoch': 0.57}
 57%|█████▋    | 2013/3507 [49:17<36:15,  1.46s/it]tensor([[-3.4062, -3.1094, -0.0996,  3.1562, -1.3203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:03,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.87 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2812, -1.9766,  2.0781, -1.1094, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.7656,  0.2656,  1.8359, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.7031,  0.1787,  2.8594, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -4.4062, -0.5508,  3.6250, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.7500,  0.8047,  1.6562, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8125, -2.6875,  0.6406,  2.3125, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -1.8594,  2.2969,  2.0312, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:04,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:34:04,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.10 | bwd_microstep: 758.47 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 757.37 | step_microstep: 1.77
[2025-11-06 18:34:04,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.98 | bwd: 759.14 | bwd_inner: 1.59 | bwd_allreduce: 757.42 | step: 1.85
 57%|█████▋    | 2014/3507 [49:18<33:42,  1.35s/it]                                                   {'loss': 0.3001, 'learning_rate': 8.094983030616517e-06, 'epoch': 0.57}
 57%|█████▋    | 2014/3507 [49:18<33:42,  1.35s/it]tensor([[-3.8281,  0.6016,  4.0000, -1.8047, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9531, -1.8281,  2.0625,  4.0938, -1.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:04,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.94 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.8750, -0.3438,  3.8438, -1.7734, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -3.2344,  0.9180,  0.8398, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -0.1680,  3.1719, -1.7578, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -1.2031,  4.0312, -0.6602, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -2.8125,  0.6055, -1.5703, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.7188, -0.4277,  2.7031, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:34:05,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:34:05,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.41 | bwd_microstep: 238.59 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 237.66 | step_microstep: 2.48
[2025-11-06 18:34:05,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.38 | bwd: 239.34 | bwd_inner: 1.46 | bwd_allreduce: 237.71 | step: 2.57
 57%|█████▋    | 2015/3507 [49:19<28:11,  1.13s/it]                                                   {'loss': 0.1375, 'learning_rate': 8.085915746307515e-06, 'epoch': 0.57}
 57%|█████▋    | 2015/3507 [49:19<28:11,  1.13s/it]tensor([[-3.7969, -0.3438,  2.5000, -1.5000, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -5.5625, -1.2578,  2.7344, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -2.9688,  0.7852, -1.8359, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:05,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.91 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.2500, -6.0000, -1.0391,  1.3906, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2500, -5.3750, -0.6133,  0.3008, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -1.2969,  2.9375, -1.6250, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -0.4844,  1.2344, -2.4062, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.3438, -3.1094,  1.7188,  1.8594, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:08,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.49 | optimizer_step: 0.40
[2025-11-06 18:34:08,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.39 | bwd_microstep: 2427.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 2426.99 | step_microstep: 3.20
[2025-11-06 18:34:08,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.33 | bwd: 2428.63 | bwd_inner: 1.45 | bwd_allreduce: 2427.04 | step: 3.28
 57%|█████▋    | 2016/3507 [49:22<40:49,  1.64s/it]                                                   {'loss': 0.3996, 'learning_rate': 8.07685009522909e-06, 'epoch': 0.57}
 57%|█████▋    | 2016/3507 [49:22<40:49,  1.64s/it]tensor([[-4.4375, -2.7656,  2.2031,  3.5781, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:08,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.03 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -4.2188, -0.5391,  2.9688, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -3.9531,  0.6289,  2.2656, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -4.3750,  0.5664,  0.9805, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9219, -1.4297,  2.4531,  3.5938, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -2.5000,  0.5273,  2.7031, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4062, -1.0547,  1.1719,  1.7891, -1.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.0469,  2.2656,  0.7344, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:34:09,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:34:09,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.22 | bwd_microstep: 712.24 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 711.23 | step_microstep: 1.60
[2025-11-06 18:34:09,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.27 | bwd: 713.23 | bwd_inner: 1.83 | bwd_allreduce: 711.27 | step: 1.69
 58%|█████▊    | 2017/3507 [49:23<36:35,  1.47s/it]                                                   {'loss': 0.7781, 'learning_rate': 8.067786085116682e-06, 'epoch': 0.58}
 58%|█████▊    | 2017/3507 [49:23<36:35,  1.47s/it]tensor([[-3.9688, -2.8438,  0.6719,  2.2188, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -2.2031,  1.6484,  0.9141, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:09,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.77 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.2500, -4.0000, -0.2734,  3.2500, -1.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1562, -6.0312, -0.9062,  2.0156, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2500,  0.4922,  1.0391, -1.7734, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-1.5938,  2.0469,  3.4844, -1.2812, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -1.1406,  2.9062, -0.7656, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3125, -0.1396,  2.8125, -0.0483, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:11,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:34:11,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.64 | bwd_microstep: 1767.14 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1766.27 | step_microstep: 1.99
[2025-11-06 18:34:11,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.44 | bwd: 1767.84 | bwd_inner: 1.41 | bwd_allreduce: 1766.30 | step: 2.05
 58%|█████▊    | 2018/3507 [49:25<41:36,  1.68s/it]                                                   {'loss': 0.5577, 'learning_rate': 8.058723723704343e-06, 'epoch': 0.58}
 58%|█████▊    | 2018/3507 [49:25<41:36,  1.68s/it]tensor([[-6.7500, -6.6562, -2.7031,  1.8047, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -3.0312,  1.0625,  2.3594, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -3.1562,  0.5273,  3.2969, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:11,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.70 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.1875, -1.7734,  2.4062, -2.7812, -5.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -3.7344,  1.0391,  2.6250, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -2.2500,  1.6953, -0.1406, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -3.1406,  2.0000,  0.5625, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -5.1562, -1.3828,  2.7656, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:12,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:34:12,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 104.57 | bwd_microstep: 663.92 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 662.68 | step_microstep: 1.53
[2025-11-06 18:34:12,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.29 | bwd: 664.84 | bwd_inner: 1.99 | bwd_allreduce: 662.72 | step: 1.61
 58%|█████▊    | 2019/3507 [49:26<36:35,  1.48s/it]                                                   {'loss': 0.659, 'learning_rate': 8.049663018724714e-06, 'epoch': 0.58}
 58%|█████▊    | 2019/3507 [49:26<36:35,  1.48s/it]tensor([[ 0.5586,  3.1250,  3.8750,  0.8086, -0.1289]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:34:12,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.72 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.9531, -4.5938, -3.2031,  0.9102, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.2188, -4.3438, -1.0312,  3.3438, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.0156, -1.1094,  1.2266,  2.2188, -0.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5469,  0.5977,  3.0781, -2.1562, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -1.3438,  3.3281,  0.5352, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8750, -4.8125, -0.6211,  1.7969, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -1.7578,  1.3828,  1.1328, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:12,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.29 | optimizer_step: 0.29
[2025-11-06 18:34:12,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.07 | bwd_microstep: 31.26 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 29.92 | step_microstep: 2.51
[2025-11-06 18:34:12,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.82 | bwd: 32.24 | bwd_inner: 2.08 | bwd_allreduce: 29.97 | step: 2.61
 58%|█████▊    | 2020/3507 [49:26<28:53,  1.17s/it]                                                   {'loss': 1.2468, 'learning_rate': 8.040603977909021e-06, 'epoch': 0.58}
 58%|█████▊    | 2020/3507 [49:26<28:53,  1.17s/it]tensor([[-5.5000, -5.4375, -2.2344,  1.6406, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:13,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.44 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -3.5312,  1.2188,  3.6094, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844,  1.1094,  2.6719, -2.8594, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -0.8164,  2.9844, -0.1309, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -4.1250,  0.0718,  2.0469, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3750, -3.0156,  1.0938, -1.6719, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -4.5938, -0.8906,  2.1250, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -5.6562, -2.6250,  1.2031, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:34:14,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:34:14,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.90 | bwd_microstep: 1556.87 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 1555.20 | step_microstep: 1.66
[2025-11-06 18:34:14,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.35 | bwd: 1557.67 | bwd_inner: 2.23 | bwd_allreduce: 1555.25 | step: 1.75
 58%|█████▊    | 2021/3507 [49:28<34:14,  1.38s/it]                                                   {'loss': 0.0874, 'learning_rate': 8.031546608987072e-06, 'epoch': 0.58}
 58%|█████▊    | 2021/3507 [49:28<34:14,  1.38s/it]tensor([[-2.7344,  0.7578,  1.9922, -2.3906, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:34:14,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 78.73 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8750, -1.3594,  1.0078, -2.5625, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.0625, -4.3125,  0.3457,  1.5156, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -4.7188, -1.0938,  2.4688, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.8125, -4.8750,  1.5547,  0.8438, -5.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -1.5391,  2.6719,  0.1196, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -4.8750, -1.3750,  2.2188, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -1.9141,  0.6133,  1.3047, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:15,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.25
[2025-11-06 18:34:15,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.36 | bwd_microstep: 816.37 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 815.17 | step_microstep: 1.82
[2025-11-06 18:34:15,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.11 | bwd: 817.30 | bwd_inner: 1.96 | bwd_allreduce: 815.21 | step: 1.90
 58%|█████▊    | 2022/3507 [49:29<32:10,  1.30s/it]                                                   {'loss': 1.0694, 'learning_rate': 8.02249091968725e-06, 'epoch': 0.58}
 58%|█████▊    | 2022/3507 [49:29<32:10,  1.30s/it]tensor([[-3.0781, -3.4219, -1.9453,  1.6094, -0.9961]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0625, -2.0938,  1.2500,  3.3750, -1.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -2.1250,  1.5625,  2.3281, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:16,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.20 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7188,  0.4902,  4.2500, -0.6953, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -2.3281,  1.9531,  0.8867, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -3.1094,  1.1250,  1.9609, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1562, -2.5156,  0.8867,  3.4844, -1.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1875,  0.3887,  2.9531, -1.0469, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:34:17,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:34:17,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.03 | bwd_microstep: 716.63 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 715.42 | step_microstep: 2.44
[2025-11-06 18:34:17,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.24 | bwd: 717.49 | bwd_inner: 1.90 | bwd_allreduce: 715.47 | step: 2.52
 58%|█████▊    | 2023/3507 [49:30<30:55,  1.25s/it]                                                   {'loss': 0.3233, 'learning_rate': 8.013436917736495e-06, 'epoch': 0.58}
 58%|█████▊    | 2023/3507 [49:30<30:55,  1.25s/it]tensor([[-4.1250, -4.8750, -3.1250,  1.4609, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:34:17,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.86 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2812, -3.8125, -0.0869,  1.0703, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -2.9688,  0.4316,  4.8438, -0.4258]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3438,  1.5312,  3.3906, -2.0000, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -2.7188,  1.0312,  0.1416, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5312, -3.4062,  0.3418,  4.3438, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -2.7969,  1.0312,  2.7656, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4688, -3.6875, -0.8945,  3.2188, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:19,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:34:19,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.46 | bwd_microstep: 1916.94 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1915.67 | step_microstep: 1.96
[2025-11-06 18:34:19,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.33 | bwd: 1917.67 | bwd_inner: 1.81 | bwd_allreduce: 1915.72 | step: 2.04
 58%|█████▊    | 2024/3507 [49:33<38:30,  1.56s/it]                                                   {'loss': 0.6258, 'learning_rate': 8.004384610860324e-06, 'epoch': 0.58}
 58%|█████▊    | 2024/3507 [49:33<38:30,  1.56s/it]tensor([[-9.9375, -8.2500, -4.9062, -3.9531, -7.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -5.1562, -1.3750,  3.9219, -1.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.2969,  2.7188, -0.1523, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:19,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.03 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -1.7344,  3.3281, -1.2656, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.5312,  0.6758,  1.9609, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1250, -4.9688, -0.0325,  2.5312, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -3.7344,  0.2139,  4.1562, -1.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -5.0625, -0.5938,  1.9219, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:34:20,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:34:20,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.23 | bwd_microstep: 366.23 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 365.07 | step_microstep: 1.95
[2025-11-06 18:34:20,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.28 | bwd: 367.12 | bwd_inner: 1.87 | bwd_allreduce: 365.11 | step: 2.03
 58%|█████▊    | 2025/3507 [49:33<32:55,  1.33s/it]                                                   {'loss': 0.3916, 'learning_rate': 7.995334006782793e-06, 'epoch': 0.58}
 58%|█████▊    | 2025/3507 [49:33<32:55,  1.33s/it]tensor([[-4.3125, -1.8828,  1.8359,  0.5508, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5703,  1.9062,  2.2969, -2.2969, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:20,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.52 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.6250, -5.1562, -0.2598,  1.7266, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.4141, 3.8750, 5.7812, 3.7188, 0.9727]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -2.9219,  1.4531, -1.2500, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -0.9961,  2.6719, -0.6250, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6562, -1.8828,  1.3359, -2.7500, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031,  0.2656,  3.2344,  0.2676, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:34:20,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:34:20,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.58 | bwd_microstep: 113.57 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 112.29 | step_microstep: 2.21
[2025-11-06 18:34:20,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.11 | bwd: 114.43 | bwd_inner: 1.96 | bwd_allreduce: 112.33 | step: 2.30
 58%|█████▊    | 2026/3507 [49:34<26:19,  1.07s/it]                                                   {'loss': 0.1876, 'learning_rate': 7.98628511322651e-06, 'epoch': 0.58}
 58%|█████▊    | 2026/3507 [49:34<26:19,  1.07s/it]tensor([[-3.8906, -1.6406,  1.5859,  0.6797, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.0625, -3.5781,  0.4258,  3.8438, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:20,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.36 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0625, -4.0625,  1.2266,  1.9844, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.8750, -3.2812, -0.3164,  4.0938, -0.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -3.6250,  0.5117,  1.9531, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -2.7969,  1.2266,  1.5391, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.8125, -7.6250, -1.6953,  1.4297, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -0.0674,  2.9062, -0.7891, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:34:23,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:34:23,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.81 | bwd_microstep: 2676.94 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2675.69 | step_microstep: 2.37
[2025-11-06 18:34:23,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.19 | bwd: 2677.88 | bwd_inner: 2.01 | bwd_allreduce: 2675.74 | step: 2.45
 58%|█████▊    | 2027/3507 [49:37<40:55,  1.66s/it]                                                   {'loss': 1.2154, 'learning_rate': 7.97723793791263e-06, 'epoch': 0.58}
 58%|█████▊    | 2027/3507 [49:37<40:55,  1.66s/it]tensor([[-4.6562, -1.3125,  1.5703, -1.7578, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.8750, -4.4375, -0.0630,  3.4531, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9688, -1.9688,  1.1328,  5.1250,  0.0708]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:23,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.76 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.3125, -0.1885,  3.8750, -0.9805, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -2.3750,  0.9141, -1.4609, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6562,  1.4844,  3.8438, -1.7344, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.2500, -2.5156,  1.1875, -0.7461, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -2.6406,  1.0156,  1.0703, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:34:24,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:34:24,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.87 | bwd_microstep: 70.72 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 69.66 | step_microstep: 1.51
[2025-11-06 18:34:24,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.67 | bwd: 71.65 | bwd_inner: 1.82 | bwd_allreduce: 69.70 | step: 1.60
 58%|█████▊    | 2028/3507 [49:37<31:55,  1.30s/it]                                                   {'loss': 0.8171, 'learning_rate': 7.968192488560829e-06, 'epoch': 0.58}
 58%|█████▊    | 2028/3507 [49:37<31:55,  1.30s/it]tensor([[-3.2812,  0.2578,  3.3594, -0.4805, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -2.8438,  2.0469,  0.6875, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:24,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.67 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.4062, -4.6875,  0.0272,  1.1250, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8438, -4.5312,  1.0078,  1.3125, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -3.7969, -0.8008,  2.3438, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.3750,  0.9141,  1.6406, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -1.3984,  1.4297,  0.6016, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5938, -3.2656, -2.2969,  1.7188, -0.4492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 18:34:26,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.37
[2025-11-06 18:34:26,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.28 | bwd_microstep: 2254.26 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2253.14 | step_microstep: 2.84
[2025-11-06 18:34:26,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 565.00 | bwd: 2255.20 | bwd_inner: 1.82 | bwd_allreduce: 2253.20 | step: 2.93
 58%|█████▊    | 2029/3507 [49:40<43:32,  1.77s/it]                                                   {'loss': 0.8539, 'learning_rate': 7.95914877288932e-06, 'epoch': 0.58}
 58%|█████▊    | 2029/3507 [49:40<43:32,  1.77s/it]tensor([[-5.0938, -5.0625, -1.4219,  2.8438, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7188, -5.4688, -0.6797,  1.5078, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:27,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.84 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0312, -2.6562,  0.8867,  0.3184, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -3.1562,  1.6953,  1.0469, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -3.4062, -0.9883,  2.7344, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.0442, 0.0718, 2.8281, 6.2188, 1.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -4.2500, -0.6328,  2.2031, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4688, -2.9688,  2.6562,  0.2305, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:27,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:34:27,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.25 | bwd_microstep: 1.42 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.65 | step_microstep: 2.21
[2025-11-06 18:34:27,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 448.10 | bwd: 2.11 | bwd_inner: 1.31 | bwd_allreduce: 0.68 | step: 2.29
 58%|█████▊    | 2030/3507 [49:41<34:15,  1.39s/it]                                                   {'loss': 0.242, 'learning_rate': 7.950106798614831e-06, 'epoch': 0.58}
 58%|█████▊    | 2030/3507 [49:41<34:15,  1.39s/it]tensor([[-6.8125, -4.0312,  1.0547,  0.1050, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4688, -6.4375, -2.6719,  1.8281, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.5000, -6.7188, -1.5000, -0.0488, -5.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:27,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.14 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2812, -5.1562, -1.3047,  2.9688, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -3.0000,  0.9531,  0.8633, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -4.7188, -0.8086,  2.4531, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7500, -4.0000,  1.8984,  1.2734, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7188, -3.2969,  1.5938, -1.0391, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:34:29,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:34:29,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.65 | bwd_microstep: 1794.70 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 1793.36 | step_microstep: 3.19
[2025-11-06 18:34:29,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.80 | bwd: 1795.36 | bwd_inner: 1.82 | bwd_allreduce: 1793.41 | step: 3.28
 58%|█████▊    | 2031/3507 [49:43<39:41,  1.61s/it]                                                   {'loss': 0.7691, 'learning_rate': 7.941066573452613e-06, 'epoch': 0.58}
 58%|█████▊    | 2031/3507 [49:43<39:41,  1.61s/it]tensor([[-4.7188, -1.5000,  1.2422, -1.8359, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -4.3438, -0.6055,  2.5938, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:29,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.65 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.1250, -4.6875,  0.1992,  1.9844, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -4.0000,  0.6133,  1.4219, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -4.1875,  0.5000,  1.7656, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -2.4531,  0.6719,  0.2637, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5312, -3.1250,  1.2109,  0.6680, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1250, -3.4844,  2.4844, -0.0596, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:34:30,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:34:30,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.51 | bwd_microstep: 110.58 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 109.43 | step_microstep: 1.96
[2025-11-06 18:34:30,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.19 | bwd: 111.39 | bwd_inner: 1.76 | bwd_allreduce: 109.47 | step: 2.03
 58%|█████▊    | 2032/3507 [49:43<31:30,  1.28s/it]                                                   {'loss': 0.3239, 'learning_rate': 7.932028105116412e-06, 'epoch': 0.58}
 58%|█████▊    | 2032/3507 [49:43<31:30,  1.28s/it]tensor([[-5.7812, -5.1250, -1.0469,  2.0156, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:30,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.64 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6250, -3.6406,  0.8125,  1.0625, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-1.0625,  2.2969,  2.5938, -1.7969, -1.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([3], device='cuda:1')
 tensor([2], device='cuda:3')
tensor([[-8.3750, -4.3750,  0.7773, -2.6406, -7.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -1.3438,  1.8984, -0.2383, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7812, -6.5938, -1.7266,  0.9609, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -0.8633,  3.5312, -0.9414, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -0.8164,  1.2812, -3.1406, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:34:30,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:34:30,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.07 | bwd_microstep: 486.95 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 485.50 | step_microstep: 1.86
[2025-11-06 18:34:30,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.75 | bwd: 487.87 | bwd_inner: 2.17 | bwd_allreduce: 485.55 | step: 1.95
 58%|█████▊    | 2033/3507 [49:44<28:38,  1.17s/it]                                                   {'loss': 0.4705, 'learning_rate': 7.922991401318487e-06, 'epoch': 0.58}
 58%|█████▊    | 2033/3507 [49:44<28:38,  1.17s/it]tensor([[-2.1250,  1.0703,  3.4688,  0.2793, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -1.2188,  2.9062,  0.1709, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -3.4219,  1.0859,  2.2031, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0312, -2.8906, -1.8359,  2.3750,  0.0679]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.7344, -0.6289,  2.7656,  2.4062, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:31,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.45 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6719, -0.9883,  0.9570, -1.3203, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -2.7969,  0.4785,  0.1357, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -3.6719,  1.4453,  2.0000, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:32,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:34:32,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.72 | bwd_microstep: 864.36 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 863.26 | step_microstep: 2.16
[2025-11-06 18:34:32,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.21 | bwd: 865.34 | bwd_inner: 1.88 | bwd_allreduce: 863.31 | step: 2.26
 58%|█████▊    | 2034/3507 [49:46<31:49,  1.30s/it]                                                   {'loss': 0.735, 'learning_rate': 7.913956469769582e-06, 'epoch': 0.58}
 58%|█████▊    | 2034/3507 [49:46<31:49,  1.30s/it]tensor([[-4.3125, -1.6719,  2.5000,  1.1641, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250, -0.7812,  1.2578,  0.1245, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -3.1094,  2.2969,  1.8516, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:32,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.45 | bwd_microstep: 1.19 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.5156,  1.8984,  4.2188, -2.1250, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.3281,  0.8320,  3.5781, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -4.4375, -0.4941,  3.0469, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7031, -2.4844, -0.3867,  2.7188, -0.8086]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -2.4531,  1.6016,  1.5859, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:34:33,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:34:33,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.76 | bwd_microstep: 426.72 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 425.64 | step_microstep: 1.69
[2025-11-06 18:34:33,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.23 | bwd: 427.91 | bwd_inner: 2.05 | bwd_allreduce: 425.70 | step: 1.80
 58%|█████▊    | 2035/3507 [49:47<28:06,  1.15s/it]                                                   {'loss': 0.3213, 'learning_rate': 7.904923318178934e-06, 'epoch': 0.58}
 58%|█████▊    | 2035/3507 [49:47<28:06,  1.15s/it]tensor([[-2.9844, -0.8281,  2.2188,  1.5469, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -0.7188,  3.2344, -1.8281, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -0.2178,  4.0625, -0.1074, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -2.9688,  1.0156,  1.8516, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -2.2969,  2.8594,  0.2080, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -4.6562, -2.0781,  1.9609, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -1.9219,  2.2500,  0.4531, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:34,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.05 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.0312, -4.4688, -0.1201,  1.2969, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:35,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:34:35,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.30 | bwd_microstep: 2.26 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.28
[2025-11-06 18:34:35,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.29 | bwd: 3.27 | bwd_inner: 2.22 | bwd_allreduce: 0.92 | step: 2.37
 58%|█████▊    | 2036/3507 [49:48<31:57,  1.30s/it]                                                   {'loss': 0.4558, 'learning_rate': 7.895891954254258e-06, 'epoch': 0.58}
 58%|█████▊    | 2036/3507 [49:48<31:57,  1.30s/it]tensor([[-3.7812, -2.8906,  0.7969,  3.1094, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:35,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.18 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1094,  1.9141,  3.6250, -1.7344, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -3.5625, -0.5117,  1.5625, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -3.4531,  1.0000,  1.0234, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.7031,  0.3105,  2.7188, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5156, -0.0045,  3.0312, -0.6562, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -1.8359,  1.6562, -2.9062, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3438, -5.5625, -0.8711,  2.2500, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:34:37,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.26 | optimizer_step: 0.34
[2025-11-06 18:34:37,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 118.40 | bwd_microstep: 1689.26 | bwd_inner_microstep: 1.58 | bwd_allreduce_microstep: 1687.53 | step_microstep: 2.74
[2025-11-06 18:34:37,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 294.61 | bwd: 1690.12 | bwd_inner: 2.36 | bwd_allreduce: 1687.57 | step: 2.82
 58%|█████▊    | 2037/3507 [49:50<37:12,  1.52s/it]                                                   {'loss': 0.1707, 'learning_rate': 7.886862385701748e-06, 'epoch': 0.58}
 58%|█████▊    | 2037/3507 [49:50<37:12,  1.52s/it]tensor([[-5.8125, -4.5000,  0.7109,  3.1406, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -0.0640,  3.8125, -1.1641, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.5000, -0.2578,  2.0938, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4844,  0.1914,  3.2656, -0.8750, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -3.7969,  0.5586,  2.4531, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -1.7031,  3.2188,  0.2891, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:37,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 286.92 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6875, -4.2500,  0.4102,  2.4219, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -3.0000,  0.3867,  1.2422, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:37,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:34:37,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.69 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 2.11
[2025-11-06 18:34:37,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 445.64 | bwd: 2.99 | bwd_inner: 2.06 | bwd_allreduce: 0.80 | step: 2.20
 58%|█████▊    | 2038/3507 [49:51<31:46,  1.30s/it]                                                   {'loss': 0.4695, 'learning_rate': 7.87783462022606e-06, 'epoch': 0.58}
 58%|█████▊    | 2038/3507 [49:51<31:46,  1.30s/it]tensor([[-0.7852,  1.6797,  4.6562,  2.7344, -0.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:37,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.49 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5000, -3.9844,  0.4453,  2.0469, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2812, -4.6250,  1.0312,  0.8789, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5312,  0.8320,  2.5469, -1.3750, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2656,  1.7344,  3.8438, -1.5469, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0000, -2.0938, -1.8281,  2.1875,  0.8320]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.7188, -3.4062,  0.5117,  2.1406, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -0.7148,  2.4219,  1.6875, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:40,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:34:40,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.58 | bwd_microstep: 2006.58 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2005.49 | step_microstep: 2.18
[2025-11-06 18:34:40,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 446.10 | bwd: 2007.56 | bwd_inner: 1.88 | bwd_allreduce: 2005.54 | step: 2.26
 58%|█████▊    | 2039/3507 [49:54<40:32,  1.66s/it]                                                   {'loss': 0.4464, 'learning_rate': 7.868808665530323e-06, 'epoch': 0.58}
 58%|█████▊    | 2039/3507 [49:54<40:32,  1.66s/it]tensor([[-3.5781,  0.0742,  3.7031,  0.1660, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000,  0.6562,  4.1562, -2.2969, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -3.1094,  2.1094,  1.4844, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:40,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.21 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.2500, -5.0312, -2.9531,  1.8906, -1.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375,  0.0854,  3.0000, -1.1016, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -1.8906,  2.8125, -0.2412, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1719, -0.8516,  1.7500,  0.3770, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -5.0312, -0.6758,  2.3750, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:40,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:34:40,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.64 | step_microstep: 1.57
[2025-11-06 18:34:40,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.68 | bwd: 2.50 | bwd_inner: 1.72 | bwd_allreduce: 0.66 | step: 1.64
 58%|█████▊    | 2040/3507 [49:54<31:29,  1.29s/it]                                                   {'loss': 0.1301, 'learning_rate': 7.859784529316103e-06, 'epoch': 0.58}
 58%|█████▊    | 2040/3507 [49:54<31:29,  1.29s/it]tensor([[-3.5312, -2.7656, -0.0742,  1.9453, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9844, -2.7812,  0.2139,  1.7812, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -4.1250, -0.1572,  2.0000, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -3.9219,  1.6094,  1.5547, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:41,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.43 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.7500,  0.2969,  2.3594, -0.8906, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1875, -6.5312, -3.4062,  1.2656, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1250, -3.3125,  1.7500, -1.3281, -6.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -2.6250,  1.6016,  1.0078, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:34:41,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:34:41,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.79 | bwd_microstep: 505.70 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 504.39 | step_microstep: 1.71
[2025-11-06 18:34:41,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 459.25 | bwd: 506.61 | bwd_inner: 2.03 | bwd_allreduce: 504.43 | step: 1.80
 58%|█████▊    | 2041/3507 [49:55<29:24,  1.20s/it]                                                   {'loss': 0.3121, 'learning_rate': 7.85076221928343e-06, 'epoch': 0.58}
 58%|█████▊    | 2041/3507 [49:55<29:24,  1.20s/it]tensor([[-4.8750, -4.8750, -1.2344,  3.1719, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.8906,  1.1484,  1.6875, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0000, -5.1562, -0.4219,  0.5312, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -1.0781,  3.4062, -0.9023, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -0.8594,  1.9453, -0.7695, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3223,  3.0156,  2.7500, -1.5000, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:34:43,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.06 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5938, -1.1875,  2.7812, -0.4277, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7969,  0.0898,  2.8906, -1.9141, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:34:43,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:34:43,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.59 | bwd_microstep: 293.23 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 292.04 | step_microstep: 2.00
[2025-11-06 18:34:43,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.66 | bwd: 294.55 | bwd_inner: 2.32 | bwd_allreduce: 292.08 | step: 2.07
 58%|█████▊    | 2042/3507 [49:57<34:00,  1.39s/it]                                                   {'loss': 0.2796, 'learning_rate': 7.841741743130765e-06, 'epoch': 0.58}
 58%|█████▊    | 2042/3507 [49:57<34:00,  1.39s/it]tensor([[ 1.0547,  4.6875,  6.2812,  0.9609, -0.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:43,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.76 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5625, -3.2031,  0.9727,  0.4375, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -0.2393,  2.3750, -0.4785, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5000, -2.5312,  1.2656,  1.5781, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -3.7500,  0.5742,  2.8125, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -2.8750,  2.8125, -0.4375, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5000, -2.2812,  2.8906,  1.1016, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -0.5234,  2.8594,  0.8516, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:44,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:34:44,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.82 | bwd_microstep: 267.92 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 266.61 | step_microstep: 1.63
[2025-11-06 18:34:44,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.55 | bwd: 268.81 | bwd_inner: 2.02 | bwd_allreduce: 266.65 | step: 1.71
 58%|█████▊    | 2043/3507 [49:58<28:55,  1.19s/it]                                                   {'loss': 0.7807, 'learning_rate': 7.832723108555016e-06, 'epoch': 0.58}
 58%|█████▊    | 2043/3507 [49:58<28:55,  1.19s/it]tensor([[-4.1250, -1.0547,  2.8906,  0.3457, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2656, -2.7031, -1.2109,  2.6094, -0.2217]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.9062, -3.2188, -0.4707, -0.5547, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.6875, -4.9688,  0.6992,  0.2402, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.1094,  1.9062, -0.0786, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8125, -4.4688,  1.0312,  1.2031, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:45,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-7.6875, -6.6875, -2.4062,  0.2197, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0625,  0.9727,  2.6562, -1.1406, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:34:46,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.26 | optimizer_step: 0.26
[2025-11-06 18:34:46,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.24 | bwd_microstep: 708.02 | bwd_inner_microstep: 5.35 | bwd_allreduce_microstep: 702.54 | step_microstep: 2.84
[2025-11-06 18:34:46,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.23 | bwd: 709.05 | bwd_inner: 6.25 | bwd_allreduce: 702.60 | step: 2.97
 58%|█████▊    | 2044/3507 [50:00<34:46,  1.43s/it]                                                   {'loss': 0.9984, 'learning_rate': 7.823706323251512e-06, 'epoch': 0.58}
 58%|█████▊    | 2044/3507 [50:00<34:46,  1.43s/it]tensor([[-5.2812, -4.8438, -1.1250,  2.0938, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -4.8750, -0.1162,  3.5312, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:46,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.19 | bwd_microstep: 3.55 | bwd_inner_microstep: 3.42 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3594,  1.2500,  2.8594, -1.2656, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0312,  1.3750,  4.5312, -1.5938, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6250,  1.7734,  3.9531, -0.1865, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6875, -6.1250, -1.5312,  2.2969, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -5.0312, -0.9023,  2.3125, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719,  0.7109,  2.7969, -0.8398, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:34:47,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:34:47,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.90 | bwd_microstep: 720.48 | bwd_inner_microstep: 3.11 | bwd_allreduce_microstep: 717.26 | step_microstep: 1.82
[2025-11-06 18:34:47,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 267.11 | bwd: 724.02 | bwd_inner: 6.55 | bwd_allreduce: 717.31 | step: 1.91
 58%|█████▊    | 2045/3507 [50:01<31:53,  1.31s/it]                                                   {'loss': 0.3442, 'learning_rate': 7.814691394914001e-06, 'epoch': 0.58}
 58%|█████▊    | 2045/3507 [50:01<31:53,  1.31s/it]tensor([[-1.2656,  2.5156,  2.8906, -2.7188, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -3.4219,  0.8398,  0.7422, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4375, -4.9688,  0.9492,  1.0781, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -3.0938,  1.6953, -0.0566, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3438, -3.3438,  2.1719,  0.7734, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125,  0.0122,  3.6406, -2.0000, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:48,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.74 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.9141, -2.6719, -1.9531,  1.8281,  0.0101]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1562, -3.6875,  0.3184, -0.6602, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:34:49,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:34:49,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.05 | bwd_microstep: 68.61 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 67.42 | step_microstep: 1.94
[2025-11-06 18:34:49,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.79 | bwd: 69.56 | bwd_inner: 1.94 | bwd_allreduce: 67.46 | step: 2.05
 58%|█████▊    | 2046/3507 [50:03<36:34,  1.50s/it]                                                   {'loss': 0.3651, 'learning_rate': 7.805678331234647e-06, 'epoch': 0.58}
 58%|█████▊    | 2046/3507 [50:03<36:34,  1.50s/it]tensor([[-1.8906,  2.5469,  3.7656, -2.6094, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:34:49,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.90 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2500, -3.9688, -1.7969,  3.0000, -0.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.9062, -0.5898,  0.7578, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594, -4.0000, -1.9844,  2.4062, -0.9414]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -3.0312,  1.8125, -0.3770, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7656,  1.9688,  3.1719, -1.8047, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -5.2188, -1.4062,  2.8281, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9453,  1.5859,  3.9688, -0.3984, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:34:49,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:34:49,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.77 | bwd_microstep: 112.85 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 111.91 | step_microstep: 2.36
[2025-11-06 18:34:49,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 240.69 | bwd: 113.64 | bwd_inner: 1.56 | bwd_allreduce: 111.94 | step: 2.43
 58%|█████▊    | 2047/3507 [50:03<28:22,  1.17s/it]                                                   {'loss': 0.2922, 'learning_rate': 7.796667139904036e-06, 'epoch': 0.58}
 58%|█████▊    | 2047/3507 [50:03<28:22,  1.17s/it]tensor([[-4.6562, -2.4375,  1.1328,  0.5078, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -4.6875, -0.5703,  2.1719, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -5.0938, -1.3047,  2.6562, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -4.1250, -0.0114,  3.2031, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -3.4219,  0.1670,  0.4258, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -1.6328,  2.7656, -0.7383, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -1.5938,  2.8750,  1.3594, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:52,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.81 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -3.0469,  0.5547,  0.8555, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:52,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:34:52,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.99 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.78 | step_microstep: 1.81
[2025-11-06 18:34:52,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.80 | bwd: 2.91 | bwd_inner: 1.98 | bwd_allreduce: 0.81 | step: 1.89
 58%|█████▊    | 2048/3507 [50:06<40:31,  1.67s/it]                                                   {'loss': 0.5894, 'learning_rate': 7.78765782861114e-06, 'epoch': 0.58}
 58%|█████▊    | 2048/3507 [50:06<40:31,  1.67s/it]tensor([[-1.9297,  0.2891,  2.6250,  0.9062, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -3.0156,  0.2139,  4.0625, -0.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -1.6797,  1.5625, -0.1953, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:52,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.50 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.2188e+00, -5.2812e+00, -7.1716e-04,  3.2812e+00, -3.4062e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -1.0938,  1.9062, -2.2188, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -2.1406,  1.9375, -0.0145, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1641, -2.1875, -1.8281,  2.2031,  0.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -1.6016,  2.5469,  0.5117, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:52,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:34:52,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.98 | bwd_microstep: 37.75 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 36.93 | step_microstep: 2.19
[2025-11-06 18:34:52,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.51 | bwd: 38.57 | bwd_inner: 1.48 | bwd_allreduce: 36.96 | step: 2.26
 58%|█████▊    | 2049/3507 [50:06<31:27,  1.29s/it]                                                   {'loss': 0.142, 'learning_rate': 7.778650405043336e-06, 'epoch': 0.58}
 58%|█████▊    | 2049/3507 [50:06<31:27,  1.29s/it]tensor([[-5.0312, -1.7266,  2.4375, -0.0559, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1250, -4.4062,  0.3379,  1.6250, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -4.4688, -0.4844,  2.7969, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.7812, -1.7188,  2.3594, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -1.6250,  2.4531,  0.5469, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.6875, -6.2188, -0.5273,  0.0304, -6.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:54,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.69 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.7969, -1.4453,  0.2773,  2.4219, -0.4414]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0625, -6.1250, -2.7031,  1.5547, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:34:55,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:34:55,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.54 | bwd_microstep: 167.55 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 166.58 | step_microstep: 2.35
[2025-11-06 18:34:55,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.25 | bwd: 168.25 | bwd_inner: 1.50 | bwd_allreduce: 166.62 | step: 2.43
 58%|█████▊    | 2050/3507 [50:09<39:32,  1.63s/it]                                                   {'loss': 0.155, 'learning_rate': 7.769644876886393e-06, 'epoch': 0.58}
 58%|█████▊    | 2050/3507 [50:09<39:32,  1.63s/it]tensor([[-2.2031, -2.6406, -0.8711,  3.0625, -0.1738]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -5.1250, -1.8594,  2.8125, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.1562,  0.4668,  0.7695, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:55,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.11 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.0938, -4.5625,  0.1113,  1.5000, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -4.5938,  0.2080,  1.5469, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -1.7422,  1.8750,  0.1123, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -2.7812,  1.8984,  0.7930, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0312, -1.9453,  2.6562, -1.6641, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:34:55,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:34:55,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.42 | bwd_microstep: 146.68 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 145.56 | step_microstep: 1.44
[2025-11-06 18:34:55,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.56 | bwd: 147.58 | bwd_inner: 1.79 | bwd_allreduce: 145.61 | step: 1.54
 58%|█████▊    | 2051/3507 [50:09<31:31,  1.30s/it]                                                   {'loss': 0.3711, 'learning_rate': 7.760641251824447e-06, 'epoch': 0.58}
 58%|█████▊    | 2051/3507 [50:09<31:31,  1.30s/it]tensor([[-1.8125,  1.8672,  3.0625, -1.8125, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -0.8906,  2.5156, -0.7617, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:56,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.19 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.8125, -5.0938,  0.5312,  0.2285, -5.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -2.4219,  1.5859, -0.4023, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3906, -4.3125, -2.9844,  1.6172, -0.9570]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -0.7539,  3.5938, -2.0000, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -4.1875,  0.0295,  2.1719, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -1.0391,  2.4375, -1.2891, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:34:57,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:34:57,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.64 | bwd_microstep: 1027.77 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1026.64 | step_microstep: 2.04
[2025-11-06 18:34:57,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.85 | bwd: 1028.62 | bwd_inner: 1.78 | bwd_allreduce: 1026.68 | step: 2.13
 59%|█████▊    | 2052/3507 [50:11<33:56,  1.40s/it]                                                   {'loss': 0.6391, 'learning_rate': 7.751639537540024e-06, 'epoch': 0.59}
 59%|█████▊    | 2052/3507 [50:11<33:56,  1.40s/it]tensor([[-1.2656,  1.9844,  2.4219, -1.3281, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.2656,  0.5117,  2.4688, -2.3906, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.8438,  0.4707,  3.8594, -1.8516, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([2], device='cuda:1')
tensor([[-2.7188,  1.0469,  2.7031, -2.0781, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:57,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.45 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.4219, -0.9492,  1.2500,  3.5625, -0.0267]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594,  0.3164,  2.0625, -2.3438, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6875, -3.7500,  2.2188,  1.3516, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781,  0.0791,  2.8125, -0.5078, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:34:57,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:34:57,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.02 | bwd_microstep: 1.41 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.64 | step_microstep: 2.02
[2025-11-06 18:34:57,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.46 | bwd: 2.24 | bwd_inner: 1.43 | bwd_allreduce: 0.68 | step: 2.10
 59%|█████▊    | 2053/3507 [50:11<26:55,  1.11s/it]                                                   {'loss': 0.2635, 'learning_rate': 7.74263974171402e-06, 'epoch': 0.59}
 59%|█████▊    | 2053/3507 [50:11<26:55,  1.11s/it]tensor([[-5.7812, -2.6875,  2.4375,  1.0234, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -3.5312, -0.3516,  2.9375, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0938, -4.4688,  0.8672,  0.6367, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -3.4688,  1.6641,  1.1797, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:34:58,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.92 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0000, -2.4375,  2.8906, -0.1143, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0000, -4.4688,  1.3516,  1.2344, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -3.8594,  0.5117,  0.1768, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -3.5625, -0.4453,  1.8047, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:35:00,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:35:00,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.76 | bwd_microstep: 1115.62 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1114.56 | step_microstep: 1.85
[2025-11-06 18:35:00,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 460.70 | bwd: 1116.47 | bwd_inner: 1.71 | bwd_allreduce: 1114.61 | step: 1.93
 59%|█████▊    | 2054/3507 [50:14<37:52,  1.56s/it]                                                   {'loss': 0.6622, 'learning_rate': 7.733641872025688e-06, 'epoch': 0.59}
 59%|█████▊    | 2054/3507 [50:14<37:52,  1.56s/it]tensor([[-4.0938, -2.5156,  0.3984,  0.8594, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -3.4688, -0.3242,  2.7500, -1.7578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -4.7500, -0.1777,  0.5586, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:00,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.75 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.1719,  0.9141,  4.1562, -1.1094, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -1.3906,  2.2969, -1.0156, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -5.5312, -1.7578,  2.8906, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0781,  1.4375,  4.1562, -2.0000, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -1.8594,  2.0781, -0.8203, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:35:01,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.29
[2025-11-06 18:35:01,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.50 | bwd_microstep: 70.79 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 69.57 | step_microstep: 1.95
[2025-11-06 18:35:01,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.27 | bwd: 71.68 | bwd_inner: 1.93 | bwd_allreduce: 69.61 | step: 2.03
 59%|█████▊    | 2055/3507 [50:14<29:55,  1.24s/it]                                                   {'loss': 0.2139, 'learning_rate': 7.724645936152643e-06, 'epoch': 0.59}
 59%|█████▊    | 2055/3507 [50:14<29:55,  1.24s/it]tensor([[-4.0625, -1.8438,  1.9141,  1.3359, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -2.0781,  1.6719,  0.4902, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:01,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.35 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.3281,  1.4453,  3.2188, -1.5781, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -3.9062,  0.5586,  1.6797, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -4.2188, -2.0625,  2.1094, -1.3359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -1.2969,  3.0781,  0.0160, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -0.0535,  2.3594, -1.6094, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -2.2188,  1.0859, -0.2139, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:35:03,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.24
[2025-11-06 18:35:03,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.65 | bwd_microstep: 2075.76 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2074.66 | step_microstep: 2.24
[2025-11-06 18:35:03,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.01 | bwd: 2076.50 | bwd_inner: 1.68 | bwd_allreduce: 2074.70 | step: 2.31
 59%|█████▊    | 2056/3507 [50:17<38:37,  1.60s/it]                                                   {'loss': 0.3761, 'learning_rate': 7.715651941770844e-06, 'epoch': 0.59}
 59%|█████▊    | 2056/3507 [50:17<38:37,  1.60s/it]tensor([[-2.7188,  0.0557,  2.1875, -0.5312, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:03,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.21 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -3.2656,  0.8477,  2.4219, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -4.4062, -0.4336,  2.8906, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7344, -3.2656, -1.4688,  2.9062, -0.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1562, -4.1250,  0.6406, -1.1016, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -2.3438,  1.3438,  0.6055, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969,  1.0703,  3.8125, -1.7578, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.2812,  0.5508,  1.8750, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:04,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:35:04,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.20 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.86 | step_microstep: 3.75
[2025-11-06 18:35:04,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.42 | bwd: 2.78 | bwd_inner: 1.71 | bwd_allreduce: 0.90 | step: 3.84
 59%|█████▊    | 2057/3507 [50:18<36:12,  1.50s/it]                                                   {'loss': 0.4603, 'learning_rate': 7.706659896554594e-06, 'epoch': 0.59}
 59%|█████▊    | 2057/3507 [50:18<36:12,  1.50s/it]tensor([[-1.0781,  2.6719,  2.9688, -2.0781, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.6250, -4.3125,  0.3125,  2.6406, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -4.1250,  0.2031,  3.5469, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:04,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.67 | bwd_microstep: 1.31 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-3.0625, -2.5469, -0.1641,  2.2500, -1.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[ 0.3457,  2.7500,  2.0000, -0.9961, -0.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7500, -0.3242,  4.0625, -1.1406, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5625, -3.7500, -0.7305,  3.5625, -1.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.3438, -1.1172,  2.3906, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:05,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:35:05,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.34 | bwd_microstep: 439.02 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 438.12 | step_microstep: 2.01
[2025-11-06 18:35:05,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 465.04 | bwd: 440.33 | bwd_inner: 1.97 | bwd_allreduce: 438.19 | step: 2.12
 59%|█████▊    | 2058/3507 [50:19<32:15,  1.34s/it]                                                   {'loss': 0.2127, 'learning_rate': 7.697669808176537e-06, 'epoch': 0.59}
 59%|█████▊    | 2058/3507 [50:19<32:15,  1.34s/it]tensor([[-3.0312,  0.8398,  2.7344, -2.1719, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:05,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 99.25 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-7.0312, -5.0625,  0.4023,  1.6484, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5000, -3.6250,  1.5156,  0.5312, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438e+00, -3.7969e+00,  1.5547e+00,  6.5308e-03, -5.3438e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9688, -4.5000,  0.7344,  2.8594, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -3.9688, -0.1855,  4.0625, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -4.0312, -1.4062,  2.5938, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -4.8125, -0.2256,  2.0312, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:06,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:35:06,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.63 | bwd_microstep: 1.76 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.56
[2025-11-06 18:35:06,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.90 | bwd: 2.71 | bwd_inner: 1.82 | bwd_allreduce: 0.76 | step: 1.63
 59%|█████▊    | 2059/3507 [50:19<25:26,  1.05s/it]                                                   {'loss': 0.6934, 'learning_rate': 7.688681684307646e-06, 'epoch': 0.59}
 59%|█████▊    | 2059/3507 [50:19<25:26,  1.05s/it]tensor([[-1.3359,  2.5312,  2.7188, -2.8594, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:35:06,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 64.67 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[0.4492, 3.0312, 4.8125, 2.2500, 0.0121]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5781, -2.2656,  0.6562,  1.6875, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -2.6094,  1.4688,  0.5977, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -1.3203,  2.8281, -0.5469, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -4.7188, -1.4844,  2.7188, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3125, -4.6562,  1.6172,  1.5156, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -4.9375, -2.9531,  1.5000, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:08,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:35:08,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.08 | bwd_microstep: 2339.41 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2338.31 | step_microstep: 2.02
[2025-11-06 18:35:08,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 234.76 | bwd: 2340.24 | bwd_inner: 1.76 | bwd_allreduce: 2338.35 | step: 2.09
 59%|█████▊    | 2060/3507 [50:22<36:38,  1.52s/it]                                                   {'loss': 0.4561, 'learning_rate': 7.679695532617214e-06, 'epoch': 0.59}
 59%|█████▊    | 2060/3507 [50:22<36:38,  1.52s/it]tensor([[-4.4375, -2.9219,  1.0469,  2.2812, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:08,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.36 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0312, -2.0625,  1.2422, -0.9492, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3125, -4.1562, -2.2031,  2.6094, -0.7227]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7500, -3.5781,  1.8203,  0.3105, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -3.5781,  0.0581,  0.2314, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8750, -3.9062, -0.6758,  3.2500, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -4.2812, -0.4434,  4.1875, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -0.9336,  2.7031, -1.7188, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:35:09,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:35:09,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.41 | bwd_microstep: 263.91 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 262.84 | step_microstep: 1.63
[2025-11-06 18:35:09,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.80 | bwd: 264.81 | bwd_inner: 1.80 | bwd_allreduce: 262.88 | step: 1.70
 59%|█████▉    | 2061/3507 [50:23<30:07,  1.25s/it]                                                   {'loss': 0.6475, 'learning_rate': 7.670711360772865e-06, 'epoch': 0.59}
 59%|█████▉    | 2061/3507 [50:23<30:07,  1.25s/it]tensor([[-4.5938, -3.9375,  0.1562,  3.1406, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:09,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.00 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2500, -4.7500, -1.0156,  2.3594, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.9062,  0.4141,  2.3750, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.5312, -0.0442,  2.2344, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -3.5469,  1.2422,  3.1562, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531,  0.6445,  2.9844, -1.5703, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -3.1875,  0.0859,  2.6250, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.8750, -5.6875, -0.3223,  0.3398, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:11,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:35:11,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.21 | bwd_microstep: 1794.87 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 1793.42 | step_microstep: 1.95
[2025-11-06 18:35:11,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.23 | bwd: 1795.74 | bwd_inner: 2.15 | bwd_allreduce: 1793.47 | step: 2.03
 59%|█████▉    | 2062/3507 [50:25<36:34,  1.52s/it]                                                   {'loss': 0.1375, 'learning_rate': 7.661729176440506e-06, 'epoch': 0.59}
 59%|█████▉    | 2062/3507 [50:25<36:34,  1.52s/it]tensor([[-6.4688, -3.6562,  1.2969,  0.0598, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -0.3906,  2.4531, -1.6172, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -5.1875, -1.6328,  2.5781, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:11,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 1.13 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-5.5625, -1.2656,  3.2188, -1.9453, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4688, -2.2969,  1.7891,  3.7188, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.4844,  0.9922,  2.8438, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -1.2500,  1.1641, -1.3594, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.7500, -0.8516,  3.2031, -0.9805, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:35:12,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:35:12,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.23 | bwd_microstep: 441.26 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 440.27 | step_microstep: 1.85
[2025-11-06 18:35:12,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.10 | bwd: 442.37 | bwd_inner: 1.90 | bwd_allreduce: 440.32 | step: 1.96
 59%|█████▉    | 2063/3507 [50:26<34:11,  1.42s/it]                                                   {'loss': 0.8068, 'learning_rate': 7.652748987284375e-06, 'epoch': 0.59}
 59%|█████▉    | 2063/3507 [50:26<34:11,  1.42s/it]tensor([[-3.4219, -3.3906, -1.7109,  1.5156, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3125, -5.1250,  0.4043,  1.1953, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:12,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.38 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.0000, -2.0781,  2.8750, -1.0391, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -3.4219,  1.1406,  2.1875, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -2.2500,  0.6094,  1.7188, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844,  1.1172,  2.6094, -2.1250, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5000, -0.4023,  3.4844, -1.5000, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.1250,  1.4219,  0.8008, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:35:13,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:35:13,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.61 | bwd_microstep: 45.27 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 44.09 | step_microstep: 1.91
[2025-11-06 18:35:13,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.02 | bwd: 46.21 | bwd_inner: 1.93 | bwd_allreduce: 44.13 | step: 2.00
 59%|█████▉    | 2064/3507 [50:26<26:54,  1.12s/it]                                                   {'loss': 0.593, 'learning_rate': 7.643770800966994e-06, 'epoch': 0.59}
 59%|█████▉    | 2064/3507 [50:26<26:54,  1.12s/it]tensor([[-4.8438, -1.0781,  2.8594, -0.9844, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -4.1250, -0.0977,  1.6016, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5781,  0.2324,  1.7812, -2.9062, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.1562, -3.4844,  0.0767,  2.6875, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5234, -1.8828, -0.2910,  3.2656,  0.3086]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-7.0000, -5.8125, -0.3711,  2.5156, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -2.3125,  1.9922, -0.2754, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:13,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.52 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.21
tensor([[-4.3750, -1.9531,  1.4844,  0.3359, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:14,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.21 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:35:14,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 1.80 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.86 | step_microstep: 3.81
[2025-11-06 18:35:14,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.21 | bwd: 2.70 | bwd_inner: 1.64 | bwd_allreduce: 0.90 | step: 4.02
 59%|█████▉    | 2065/3507 [50:27<26:37,  1.11s/it]                                                   {'loss': 0.9097, 'learning_rate': 7.634794625149184e-06, 'epoch': 0.59}
 59%|█████▉    | 2065/3507 [50:27<26:37,  1.11s/it]tensor([[-5.3750, -2.7500,  1.6406,  0.9219, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:14,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.94 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5312, -2.6250,  1.1328, -0.9102, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -5.4062, -0.9180,  1.5938, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9531, -0.8320,  2.4062, -0.1650, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -3.8594, -0.7227,  3.2656, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -3.2969,  0.4668,  4.0938, -1.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6250,  0.6133,  2.8125, -0.6406, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -4.0938,  0.3984,  3.1719, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:16,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:35:16,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.22 | bwd_microstep: 2057.98 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 2056.99 | step_microstep: 2.04
[2025-11-06 18:35:16,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.18 | bwd: 2058.86 | bwd_inner: 1.70 | bwd_allreduce: 2057.03 | step: 2.12
 59%|█████▉    | 2066/3507 [50:30<36:31,  1.52s/it]                                                   {'loss': 0.127, 'learning_rate': 7.625820467490047e-06, 'epoch': 0.59}
 59%|█████▉    | 2066/3507 [50:30<36:31,  1.52s/it]tensor([[-3.3750, -3.4844, -0.7852,  2.9375, -1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9297,  2.2500,  3.7969,  0.6328, -1.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -3.2344,  2.1406,  0.3145, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -1.7344,  1.5000, -0.8555, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -2.5781,  1.5859, -0.1641, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -3.1406,  0.0444,  0.0928, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -1.2422,  3.3125, -0.6758, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:17,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.79 | bwd_microstep: 5.62 | bwd_inner_microstep: 5.47 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-6.1250, -3.6250,  0.4961, -0.3145, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:17,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.25 | optimizer_step: 0.33
[2025-11-06 18:35:17,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.39 | bwd_microstep: 2.25 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1.09 | step_microstep: 2.96
[2025-11-06 18:35:17,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.21 | bwd: 7.86 | bwd_inner: 6.53 | bwd_allreduce: 1.16 | step: 3.09
 59%|█████▉    | 2067/3507 [50:31<34:12,  1.43s/it]                                                   {'loss': 0.4516, 'learning_rate': 7.6168483356469555e-06, 'epoch': 0.59}
 59%|█████▉    | 2067/3507 [50:31<34:12,  1.43s/it]tensor([[-4.7500, -3.8438, -0.3281,  1.7891, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:17,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.19 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.9375, -0.5000,  3.4375,  0.2930, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -3.9531, -0.7305,  3.2031, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -4.2500, -1.0312,  3.4375, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1875, -1.9141,  2.8750, -1.8516, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812, -0.8477,  2.6094,  0.3379, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1250, -2.5000, -0.1689,  3.8750, -0.0292]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -3.0156,  0.8516, -0.1865, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:35:20,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:35:20,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.75 | bwd_microstep: 1883.40 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 1881.89 | step_microstep: 2.87
[2025-11-06 18:35:20,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.92 | bwd: 1884.34 | bwd_inner: 2.23 | bwd_allreduce: 1881.95 | step: 2.98
 59%|█████▉    | 2068/3507 [50:33<40:10,  1.68s/it]                                                   {'loss': 0.0942, 'learning_rate': 7.607878237275561e-06, 'epoch': 0.59}
 59%|█████▉    | 2068/3507 [50:33<40:10,  1.68s/it]tensor([[-3.7969, -3.3594,  0.3262,  3.4688, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:20,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.38 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.4219,  1.8984,  3.6406, -2.5312, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.3438, -1.7891,  1.3984,  0.1318, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188e+00, -4.5000e+00, -3.6774e-03,  2.1562e+00, -3.4688e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -4.6875,  0.8633,  2.4531, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -4.4688, -0.2490,  2.6562, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -4.4062, -2.1562,  2.3750, -1.2422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-7.7812, -5.5000, -0.5469, -0.1934, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:21,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.30
[2025-11-06 18:35:21,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.04 | bwd_microstep: 807.00 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 805.84 | step_microstep: 2.08
[2025-11-06 18:35:21,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.44 | bwd: 808.11 | bwd_inner: 2.08 | bwd_allreduce: 805.89 | step: 2.17
 59%|█████▉    | 2069/3507 [50:35<36:31,  1.52s/it]                                                   {'loss': 0.85, 'learning_rate': 7.598910180029783e-06, 'epoch': 0.59}
 59%|█████▉    | 2069/3507 [50:35<36:31,  1.52s/it]tensor([[-5.0000, -2.3906,  1.0625, -0.1641, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -0.5508,  3.4688, -2.2656, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:21,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.85 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5938,  0.2891,  3.4844,  0.9023, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3281, -2.8438, -0.8633,  3.1719, -0.1787]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -3.5625,  0.4355,  1.6562, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1562, -5.2812,  0.3066,  1.6797, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -3.6875,  0.6055, -0.3398, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0156,  0.3691,  2.9062,  1.6641, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:35:23,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:35:23,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.10 | bwd_microstep: 1449.59 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1448.36 | step_microstep: 2.22
[2025-11-06 18:35:23,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.97 | bwd: 1450.53 | bwd_inner: 1.98 | bwd_allreduce: 1448.41 | step: 2.29
 59%|█████▉    | 2070/3507 [50:36<39:18,  1.64s/it]                                                   {'loss': 0.2062, 'learning_rate': 7.5899441715617906e-06, 'epoch': 0.59}
 59%|█████▉    | 2070/3507 [50:36<39:18,  1.64s/it]tensor([[-4.1562, -2.8750,  0.2871,  1.9688, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:23,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.38 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7812, -3.0469,  0.0635,  2.4844, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -0.9414,  2.1719, -0.4297, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3750,  0.9141,  3.1406, -2.1875, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -1.7891,  1.2578, -2.1250, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406, -3.2812, -1.9766,  1.9219, -0.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2500, -5.3750,  0.5391,  2.0000, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -0.8750,  1.9219, -0.5430, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:24,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:35:24,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.23 | bwd_microstep: 1.60 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.97
[2025-11-06 18:35:24,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.63 | bwd: 2.35 | bwd_inner: 1.42 | bwd_allreduce: 0.80 | step: 2.05
 59%|█████▉    | 2071/3507 [50:38<36:12,  1.51s/it]                                                   {'loss': 0.1338, 'learning_rate': 7.580980219522015e-06, 'epoch': 0.59}
 59%|█████▉    | 2071/3507 [50:38<36:12,  1.51s/it]tensor([[-5.5000, -3.6094,  0.9062,  1.6953, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -0.1416,  1.8125, -3.6562, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -4.9062, -0.2031,  1.6328, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:24,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.38 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.6250, -8.0000, -2.8281,  1.4609, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -3.5156,  0.8320,  1.2656, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -3.5938,  0.8906,  2.4062, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2812, -3.2656,  1.7578,  2.2656, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -2.7969,  0.7305, -0.1621, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:26,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:35:26,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 1539.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 1539.02 | step_microstep: 2.13
[2025-11-06 18:35:26,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.88 | bwd: 1540.62 | bwd_inner: 1.43 | bwd_allreduce: 1539.06 | step: 2.21
 59%|█████▉    | 2072/3507 [50:40<39:56,  1.67s/it]                                                   {'loss': 0.6227, 'learning_rate': 7.572018331559126e-06, 'epoch': 0.59}
 59%|█████▉    | 2072/3507 [50:40<39:56,  1.67s/it]tensor([[-5.3125, -3.5625,  0.9453,  1.7578, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:26,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.40 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.7188, -1.5234,  3.4219, -0.8086, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -2.8594,  1.0391,  1.0391, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -0.2334,  4.1562, -1.6562, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -4.1250, -0.6172,  1.2031, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0781,  1.7656,  2.9844, -2.0156, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.8906, -3.2344, -0.2461,  2.1719, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7578,  0.4473,  3.9844,  3.3906, -1.0547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:27,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:35:27,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.91 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.07
[2025-11-06 18:35:27,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.32 | bwd: 2.63 | bwd_inner: 1.75 | bwd_allreduce: 0.76 | step: 2.15
 59%|█████▉    | 2073/3507 [50:41<37:46,  1.58s/it]                                                   {'loss': 0.7201, 'learning_rate': 7.5630585153200286e-06, 'epoch': 0.59}
 59%|█████▉    | 2073/3507 [50:41<37:46,  1.58s/it]tensor([[-5.7812, -5.8750, -2.8750,  1.2266, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -3.5781,  1.9531,  0.3672, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -3.0312,  0.1152,  1.2812, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:27,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.21 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8281,  1.2266,  3.4844, -1.4766, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -0.3125,  3.7500,  0.1143, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -2.8438, -0.0513,  2.7188, -1.2891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -3.1406,  0.5742,  2.4062, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -3.0469, -0.0203,  2.0156, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:29,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:35:29,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.96 | bwd_microstep: 1279.84 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1278.77 | step_microstep: 1.66
[2025-11-06 18:35:29,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.20 | bwd: 1280.65 | bwd_inner: 1.70 | bwd_allreduce: 1278.81 | step: 1.74
 59%|█████▉    | 2074/3507 [50:43<38:27,  1.61s/it]                                                   {'loss': 0.7522, 'learning_rate': 7.554100778449866e-06, 'epoch': 0.59}
 59%|█████▉    | 2074/3507 [50:43<38:27,  1.61s/it]tensor([[-2.7188, -3.4531, -1.6875,  2.6875, -0.4277]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.6562, -4.4688, -0.8242,  0.9492, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([3], device='cuda:1')
tensor([[-3.5938, -2.9062,  0.6133,  3.4688, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:29,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.01 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6562,  0.0269,  2.4688, -1.7734, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -3.3438,  2.0938,  0.3223, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -3.2344,  0.9922,  1.3672, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2812,  2.8125,  3.6094, -2.5938, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.7812, -2.1094,  2.8906, -0.0635, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:35:31,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.09 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.33
[2025-11-06 18:35:31,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 495.14 | bwd: 3.05 | bwd_inner: 2.03 | bwd_allreduce: 0.88 | step: 2.41
 59%|█████▉    | 2075/3507 [50:45<39:56,  1.67s/it]                                                   {'loss': 0.2914, 'learning_rate': 7.545145128592009e-06, 'epoch': 0.59}
 59%|█████▉    | 2075/3507 [50:45<39:56,  1.67s/it]tensor([[-6.1250, -5.5938, -0.8984,  2.8906, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.7812,  0.9062,  1.6328, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -3.5312,  0.6523,  0.5625, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:31,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.6875, -1.7109,  3.5312, -0.2227, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -4.3125,  0.7070,  1.2891, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -3.2969,  0.4453,  1.9531, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.5625, -4.7812, -0.6797,  2.0938, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([2], device='cuda:3')
tensor([[-5.0625, -0.4375,  3.6875, -2.2031, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:35:31,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:35:31,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.79 | bwd_microstep: 313.05 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 311.87 | step_microstep: 1.96
[2025-11-06 18:35:31,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.29 | bwd: 313.87 | bwd_inner: 1.81 | bwd_allreduce: 311.92 | step: 2.04
 59%|█████▉    | 2076/3507 [50:45<32:58,  1.38s/it]                                                   {'loss': 0.684, 'learning_rate': 7.536191573388042e-06, 'epoch': 0.59}
 59%|█████▉    | 2076/3507 [50:45<32:58,  1.38s/it]tensor([[-6.0938, -2.7188,  2.3906,  0.2305, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.8125, -5.5312, -1.4062,  2.6250, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([3], device='cuda:1')
[2025-11-06 18:35:32,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.02 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.3906,  0.0574,  2.1406, -1.5156, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969,  0.2363,  3.3438, -1.3359, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -1.7734,  2.4219,  0.0170, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1562, -2.2969,  1.8984, -1.6797, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -4.4375, -1.2656,  3.8906, -1.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -4.9062, -0.3125,  1.8203, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:33,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:35:33,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.71 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.35
[2025-11-06 18:35:33,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.77 | bwd: 2.98 | bwd_inner: 1.92 | bwd_allreduce: 0.92 | step: 2.42
 59%|█████▉    | 2077/3507 [50:47<37:08,  1.56s/it]                                                   {'loss': 0.0764, 'learning_rate': 7.527240120477771e-06, 'epoch': 0.59}
 59%|█████▉    | 2077/3507 [50:47<37:08,  1.56s/it]tensor([[-5.1250, -2.8594,  1.2734,  0.8242, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -2.2500,  1.4688,  0.2812, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8125, -3.2188,  0.5938,  3.5469, -1.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -3.9375,  0.3535,  1.8906, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:34,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.70 | bwd_microstep: 3.14 | bwd_inner_microstep: 2.97 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-5.2812, -4.1875, -0.3809,  1.5938, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -3.4219,  0.5156,  0.6875, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688, -1.7188,  1.4297,  2.7656, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -4.5000,  0.1196,  3.1406, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:34,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:35:34,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.39 | bwd_microstep: 10.71 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 9.65 | step_microstep: 1.91
[2025-11-06 18:35:34,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.10 | bwd: 13.85 | bwd_inner: 3.94 | bwd_allreduce: 9.72 | step: 2.03
 59%|█████▉    | 2078/3507 [50:48<29:02,  1.22s/it]                                                   {'loss': 0.5365, 'learning_rate': 7.51829077749919e-06, 'epoch': 0.59}
 59%|█████▉    | 2078/3507 [50:48<29:02,  1.22s/it]tensor([[ 0.0879,  3.1406,  2.8125, -1.2109, -0.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6875, -2.9844,  1.4141,  0.4668, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1250,  1.9844,  2.3438, -1.5938, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.7188, -3.0625, -1.0234,  2.6562, -0.6367]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:34,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.29 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.1562, -3.1562,  0.5977,  0.7812, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -3.6562,  0.7617,  2.4062, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -3.5000,  0.4355,  2.6094, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4766,  1.6094,  2.1562, -1.2812, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:35:36,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.15 | optimizer_step: 0.14
[2025-11-06 18:35:36,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.99 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.16
[2025-11-06 18:35:36,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.28 | bwd: 2.90 | bwd_inner: 1.89 | bwd_allreduce: 0.87 | step: 2.27
 59%|█████▉    | 2079/3507 [50:49<32:55,  1.38s/it]                                                   {'loss': 0.5092, 'learning_rate': 7.509343552088513e-06, 'epoch': 0.59}
 59%|█████▉    | 2079/3507 [50:49<32:55,  1.38s/it]tensor([[-1.3984,  1.5156,  2.2344, -1.5547, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406,  0.4082,  2.6094, -1.2969, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750,  0.6680,  3.2031, -1.9922, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:36,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.70 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3438, -4.4062,  0.1729,  3.0469, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.8438, -0.8867,  1.7969, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -1.6094,  2.4688, -0.0137, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -0.7812,  3.2031, -0.9414, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -2.8125,  0.6172,  0.0344, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:37,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:35:37,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.24 | bwd_microstep: 1006.00 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1004.95 | step_microstep: 1.84
[2025-11-06 18:35:37,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 438.97 | bwd: 1006.83 | bwd_inner: 1.72 | bwd_allreduce: 1004.98 | step: 1.92
 59%|█████▉    | 2080/3507 [50:51<33:37,  1.41s/it]                                                   {'loss': 0.2479, 'learning_rate': 7.500398451880133e-06, 'epoch': 0.59}
 59%|█████▉    | 2080/3507 [50:51<33:37,  1.41s/it]tensor([[-5.4688, -3.0312,  1.0156, -0.0796, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -3.5781,  0.9414,  2.5000, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8125, -2.2031,  1.6562,  2.7500, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3750,  0.8320,  1.4688, -2.2188, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2500, -3.5312,  1.6484, -0.9961, -6.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:37,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.13 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6875, -4.2188, -0.6055,  2.7031, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.8438, -1.8359,  2.6875, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -4.4062, -0.5156,  2.3281, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:39,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:35:39,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 87.11 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.47
[2025-11-06 18:35:39,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.25 | bwd: 2.74 | bwd_inner: 1.72 | bwd_allreduce: 0.89 | step: 2.55
 59%|█████▉    | 2081/3507 [50:53<35:19,  1.49s/it]                                                   {'loss': 0.3206, 'learning_rate': 7.491455484506643e-06, 'epoch': 0.59}
 59%|█████▉    | 2081/3507 [50:53<35:19,  1.49s/it]tensor([[-6.3125, -4.2812,  1.1094,  1.9453, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3281,  0.6719,  2.9844, -0.3574, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7188,  0.8086,  3.0000, -1.3594, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4062,  1.5469,  4.6250, -0.0183, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:39,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.43 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.3750, -3.7812, -1.5703,  2.5625, -1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5000, -0.2754,  2.9062, -0.2754, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -2.0781,  1.5781,  0.4121, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.8125,  0.1621,  2.8125, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:40,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:35:40,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.45 | bwd_microstep: 710.43 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 709.27 | step_microstep: 2.03
[2025-11-06 18:35:40,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 523.91 | bwd: 711.40 | bwd_inner: 1.94 | bwd_allreduce: 709.32 | step: 2.11
 59%|█████▉    | 2082/3507 [50:54<33:52,  1.43s/it]                                                   {'loss': 0.1472, 'learning_rate': 7.4825146575988e-06, 'epoch': 0.59}
 59%|█████▉    | 2082/3507 [50:54<33:52,  1.43s/it]tensor([[-3.0625, -2.2188,  0.9531,  2.7656, -1.4609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5312, -6.7812, -3.8906,  0.6172, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:35:40,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.36 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4688,  1.6250,  3.1562, -2.1094, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -0.5703,  3.3125, -1.2422, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -4.0312, -0.8438,  2.2188, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2188, -1.7891,  0.1836,  4.2812,  0.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -3.8125,  0.3086,  1.5625, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -5.3438, -2.4375,  2.2500, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:35:41,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:35:41,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.48 | bwd_microstep: 61.37 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 60.52 | step_microstep: 2.64
[2025-11-06 18:35:41,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 281.86 | bwd: 62.12 | bwd_inner: 1.42 | bwd_allreduce: 60.56 | step: 2.72
 59%|█████▉    | 2083/3507 [50:54<26:58,  1.14s/it]                                                   {'loss': 0.3303, 'learning_rate': 7.4735759787855525e-06, 'epoch': 0.59}
 59%|█████▉    | 2083/3507 [50:54<26:58,  1.14s/it]tensor([[-3.9062,  0.0708,  3.8125, -0.4746, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.1250, -6.6562, -1.0391,  1.3359, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:41,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.29 | bwd_microstep: 5.82 | bwd_inner_microstep: 5.68 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-7.1250, -6.5312, -2.2656,  1.2109, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -3.4219,  2.1719,  2.2969, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3438, -4.5625,  0.6484,  1.7734, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.4062, -1.1484,  2.2344, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -4.3750, -0.1680,  3.2500, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -2.2969,  0.3086,  0.8359, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:35:47,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 18:35:47,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.08 | bwd_microstep: 6285.46 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 6284.13 | step_microstep: 3.11
[2025-11-06 18:35:47,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 465.41 | bwd: 6291.28 | bwd_inner: 6.91 | bwd_allreduce: 6284.19 | step: 3.21
 59%|█████▉    | 2084/3507 [51:01<1:08:21,  2.88s/it]                                                     {'loss': 0.2266, 'learning_rate': 7.464639455693996e-06, 'epoch': 0.59}
 59%|█████▉    | 2084/3507 [51:01<1:08:21,  2.88s/it]tensor([[ 3.0670e-03,  2.4531e+00,  3.1094e+00,  2.5586e-01, -5.4688e-01]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -2.6094,  1.5625,  0.9648, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:48,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.00 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4453,  2.2500,  3.3750, -1.2500, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0000, -6.9688, -3.5781,  0.8203, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0250e+01, -7.7500e+00, -9.8828e-01, -6.1951e-03, -7.2812e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5312,  1.5781,  4.0312, -1.3984, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9531, -1.0391,  2.6406,  0.2441, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -5.0000, -2.1406,  3.4688, -1.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:35:48,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:35:48,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.38 | bwd_microstep: 131.16 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 130.15 | step_microstep: 2.10
[2025-11-06 18:35:48,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.40 | bwd: 131.87 | bwd_inner: 1.55 | bwd_allreduce: 130.18 | step: 2.19
 59%|█████▉    | 2085/3507 [51:02<50:54,  2.15s/it]                                                     {'loss': 0.5328, 'learning_rate': 7.455705095949403e-06, 'epoch': 0.59}
 59%|█████▉    | 2085/3507 [51:02<50:54,  2.15s/it]tensor([[-4.6250, -2.6719,  1.3125,  1.5859, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:48,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.51 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5625, -2.4844,  2.3906,  2.7812, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -1.2422,  2.7031, -0.1089, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -3.2344,  1.2812,  1.2422, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -2.3438, -0.3184, -1.8750, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7188, -3.2188,  1.1562,  0.5586, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -3.5000,  1.4453,  1.5156, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5000,  1.5078,  2.6875, -2.5781, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:35:48,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:35:48,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.52 | bwd_microstep: 2.82 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1.84 | step_microstep: 1.73
[2025-11-06 18:35:48,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.06 | bwd: 3.70 | bwd_inner: 1.71 | bwd_allreduce: 1.88 | step: 1.81
 59%|█████▉    | 2086/3507 [51:02<38:32,  1.63s/it]                                                   {'loss': 0.561, 'learning_rate': 7.446772907175191e-06, 'epoch': 0.59}
 59%|█████▉    | 2086/3507 [51:02<38:32,  1.63s/it]tensor([[-3.3281,  0.1582,  2.4688, -1.3906, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -3.5469,  0.5078,  1.1875, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -0.8164,  1.2031,  0.3867, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:49,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.03 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1250, -0.1494,  2.8438, -2.2500, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -1.2812,  3.9062,  1.4531, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -3.9375, -0.4824,  1.6953, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -1.3359,  4.1250,  0.0129, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125,  0.2793,  3.7188, -1.9141, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:35:49,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:35:49,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.69 | bwd_microstep: 69.71 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 68.45 | step_microstep: 2.63
[2025-11-06 18:35:49,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 431.75 | bwd: 70.54 | bwd_inner: 1.94 | bwd_allreduce: 68.48 | step: 2.70
 60%|█████▉    | 2087/3507 [51:03<30:48,  1.30s/it]                                                   {'loss': 0.1689, 'learning_rate': 7.437842896992933e-06, 'epoch': 0.6}
 60%|█████▉    | 2087/3507 [51:03<30:48,  1.30s/it]tensor([[-3.8438, -1.5312,  1.7578,  0.7031, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -3.8594,  0.1245,  3.3281, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -4.3125,  0.1553,  2.9375, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:35:49,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.69 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6562, -2.9219, -0.1758,  1.5156, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -0.1836,  3.2344, -1.6094, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -3.2188,  0.9609,  2.8594, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -2.9844,  1.4922,  1.7266, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156,  0.3008,  2.7500, -0.9688, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:35:50,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:35:50,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.63 | bwd_microstep: 946.68 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 945.62 | step_microstep: 2.69
[2025-11-06 18:35:50,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.34 | bwd: 947.58 | bwd_inner: 1.78 | bwd_allreduce: 945.66 | step: 2.78
 60%|█████▉    | 2088/3507 [51:04<31:22,  1.33s/it]                                                   {'loss': 0.5399, 'learning_rate': 7.4289150730223355e-06, 'epoch': 0.6}
 60%|█████▉    | 2088/3507 [51:04<31:22,  1.33s/it]tensor([[-4.5938, -5.4062, -3.2812,  1.7422, -1.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5859,  2.0469,  2.9688, -1.5469, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.4062, -1.3438,  0.7617,  1.7031, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -0.1885,  1.6875, -2.5312, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -5.2188,  0.7617,  2.1719, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -4.7812, -0.9414,  2.8125, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:52,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.45 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.5312, -5.2188, -1.7891,  1.8750, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.3262, 0.0513, 1.9688, 5.5312, 1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:52,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:35:52,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 199.50 | bwd_microstep: 2.53 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 1.12 | step_microstep: 2.88
[2025-11-06 18:35:52,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.97 | bwd: 3.43 | bwd_inner: 2.09 | bwd_allreduce: 1.16 | step: 2.96
 60%|█████▉    | 2089/3507 [51:06<34:08,  1.44s/it]                                                   {'loss': 0.455, 'learning_rate': 7.4199894428812435e-06, 'epoch': 0.6}
 60%|█████▉    | 2089/3507 [51:06<34:08,  1.44s/it]tensor([[-8.8125, -7.6250, -3.0938, -0.6758, -5.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -5.6562, -2.9531,  1.8906, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0156,  0.1133,  2.4219, -1.0469, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:52,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.08 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4688, -4.9688,  0.3223,  2.5469, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -5.0312, -2.4375,  1.7969, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -3.5625,  0.7305,  1.4766, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8438, -5.5938, -1.0703,  1.3281, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.8750, -5.9688, -1.0156,  2.1562, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:35:54,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:35:54,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.15 | bwd_microstep: 507.90 | bwd_inner_microstep: 6.74 | bwd_allreduce_microstep: 501.05 | step_microstep: 1.84
[2025-11-06 18:35:54,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.25 | bwd: 508.59 | bwd_inner: 7.34 | bwd_allreduce: 501.10 | step: 1.92
 60%|█████▉    | 2090/3507 [51:08<36:26,  1.54s/it]                                                   {'loss': 0.5485, 'learning_rate': 7.411066014185624e-06, 'epoch': 0.6}
 60%|█████▉    | 2090/3507 [51:08<36:26,  1.54s/it]tensor([[-4.8438, -4.8125, -1.0234,  3.0000, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1406,  1.7656,  3.1875, -1.7031, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.9062, -2.4062,  1.3281,  0.2832, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2188,  1.3672,  3.6562, -0.3359, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -4.4375, -1.3594,  2.9062, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -4.3125, -1.1484,  3.0000, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:57,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.43 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-4.7500, -3.8438, -0.2188,  1.9297, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -4.9062, -1.4531,  2.2656, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:57,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 18:35:57,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.47 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.44
[2025-11-06 18:35:57,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.91 | bwd: 2.79 | bwd_inner: 1.74 | bwd_allreduce: 0.88 | step: 2.54
 60%|█████▉    | 2091/3507 [51:11<46:59,  1.99s/it]                                                   {'loss': 0.5582, 'learning_rate': 7.402144794549577e-06, 'epoch': 0.6}
 60%|█████▉    | 2091/3507 [51:11<46:59,  1.99s/it]tensor([[-7.0312, -5.1250,  0.4688,  2.0000, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:35:57,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.85 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9688,  0.3633,  3.1719, -2.0312, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3125,  0.7148,  3.3906, -1.6328, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -5.2188, -0.2490,  3.6406, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3281, -2.0156, -1.8906,  1.3125,  0.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6719, -2.6875, -0.7383,  2.5000, -0.7617]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.0000, -3.1875,  0.6484,  2.9844, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.5000, -8.0000, -2.0156,  0.9375, -6.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:35:57,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:35:57,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.04 | bwd_microstep: 53.35 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 52.43 | step_microstep: 1.81
[2025-11-06 18:35:57,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.92 | bwd: 54.04 | bwd_inner: 1.43 | bwd_allreduce: 52.46 | step: 1.88
 60%|█████▉    | 2092/3507 [51:11<35:53,  1.52s/it]                                                   {'loss': 0.5346, 'learning_rate': 7.3932257915853056e-06, 'epoch': 0.6}
 60%|█████▉    | 2092/3507 [51:11<35:53,  1.52s/it]tensor([[-4.4688, -4.7812, -1.5391,  3.2344, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7812,  0.0186,  2.9375, -1.0781, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.0625,  0.7305,  3.8594, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -3.5156,  1.6953,  0.6992, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -5.1250, -2.3750,  1.7422, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-4.6250,  0.1426,  3.8125, -2.7188, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([2], device='cuda:0')tensor([3], device='cuda:2')

[2025-11-06 18:36:00,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.88 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5000, -4.4062, -0.5156,  3.5938, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -4.9688, -1.1562,  2.4688, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:00,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:36:00,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.24 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.70
[2025-11-06 18:36:00,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.14 | bwd: 2.41 | bwd_inner: 1.48 | bwd_allreduce: 0.79 | step: 2.79
 60%|█████▉    | 2093/3507 [51:14<46:49,  1.99s/it]                                                   {'loss': 0.0715, 'learning_rate': 7.3843090129031335e-06, 'epoch': 0.6}
 60%|█████▉    | 2093/3507 [51:14<46:49,  1.99s/it]tensor([[-1.6406, -2.6094, -1.6406,  2.7031,  0.4863]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.8281,  0.1660,  1.8906, -3.2656, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:00,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.88 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2344, -2.9375,  0.4805,  3.8750, -1.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.0625, -6.4375, -0.5938,  1.3906, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -1.3516,  1.8203,  0.9258, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3438, -4.6875,  0.1934,  3.7812, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.3281,  0.7891,  2.0469, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844,  0.4863,  3.0781, -1.3906, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:36:01,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:36:01,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.98 | bwd_microstep: 68.30 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 67.28 | step_microstep: 1.61
[2025-11-06 18:36:01,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.90 | bwd: 68.99 | bwd_inner: 1.53 | bwd_allreduce: 67.32 | step: 1.69
 60%|█████▉    | 2094/3507 [51:15<35:56,  1.53s/it]                                                   {'loss': 1.237, 'learning_rate': 7.375394466111479e-06, 'epoch': 0.6}
 60%|█████▉    | 2094/3507 [51:15<35:56,  1.53s/it]tensor([[-6.0000, -3.7500,  1.5625,  2.0156, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -3.7656, -0.3867, -0.9414, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -4.0625,  1.5312,  1.5156, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -5.1250, -1.1562,  2.7031, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -5.0625, -2.0469,  2.4688, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3125, -6.4062, -1.1406,  2.1094, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -4.0312,  0.2969,  2.2031, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:02,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9531,  0.6641,  3.8125,  0.0757, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:02,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:36:02,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.13 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.28
[2025-11-06 18:36:02,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.87 | bwd: 2.93 | bwd_inner: 1.86 | bwd_allreduce: 0.91 | step: 2.37
 60%|█████▉    | 2095/3507 [51:16<34:45,  1.48s/it]                                                   {'loss': 0.3052, 'learning_rate': 7.366482158816851e-06, 'epoch': 0.6}
 60%|█████▉    | 2095/3507 [51:16<34:45,  1.48s/it]tensor([[-5.2188, -0.7969,  3.5625, -1.7266, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1562, -6.9062, -2.3594,  1.6875, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -1.6641,  2.1406, -0.9336, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.3438, -0.3809,  3.8125, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:02,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.73 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9062, -4.7188, -0.9062,  2.9844, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -3.7188,  1.1797,  0.4727, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -4.4688, -1.7422,  3.0938, -1.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -5.3438, -0.5508,  2.4688, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:36:03,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:36:03,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.05 | bwd_microstep: 247.67 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 246.73 | step_microstep: 1.47
[2025-11-06 18:36:03,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.79 | bwd: 248.56 | bwd_inner: 1.63 | bwd_allreduce: 246.76 | step: 1.56
 60%|█████▉    | 2096/3507 [51:17<31:44,  1.35s/it]                                                   {'loss': 0.4633, 'learning_rate': 7.357572098623855e-06, 'epoch': 0.6}
 60%|█████▉    | 2096/3507 [51:17<31:44,  1.35s/it]tensor([[-6.9688, -3.8906,  2.2344,  1.1562, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -3.0938,  2.7031,  1.5625, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3438, -5.6562, -0.4648,  1.2109, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312,  1.2891,  3.4531, -2.0312, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -5.4062, -1.4219,  2.2656, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -4.5312, -0.7188,  3.0625, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:04,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.41 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4375, -3.2344, -0.3242,  3.0781, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -5.1562, -1.1562,  2.5000, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:36:05,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.98 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:36:05,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.00 | bwd_microstep: 44.03 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 43.15 | step_microstep: 3.87
[2025-11-06 18:36:05,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.41 | bwd: 44.81 | bwd_inner: 1.46 | bwd_allreduce: 43.19 | step: 3.96
 60%|█████▉    | 2097/3507 [51:19<33:11,  1.41s/it]                                                   {'loss': 0.2681, 'learning_rate': 7.3486642931351835e-06, 'epoch': 0.6}
 60%|█████▉    | 2097/3507 [51:19<33:11,  1.41s/it]tensor([[-4.8438, -1.7109,  2.5469,  0.2344, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -1.0000,  3.2188, -1.0469, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -4.7500, -1.4688,  2.4375, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -0.7344,  2.9844,  0.9219, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.2500, -5.5312, -1.4766,  1.6953, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:05,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.17 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -0.5234,  3.0312, -0.8125, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -3.8281,  0.2100,  1.8984, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -4.3125, -0.3750,  1.7109, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:06,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:36:06,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.40 | bwd_microstep: 1.88 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.83
[2025-11-06 18:36:06,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.57 | bwd: 2.78 | bwd_inner: 1.83 | bwd_allreduce: 0.82 | step: 1.92
 60%|█████▉    | 2098/3507 [51:19<29:45,  1.27s/it]                                                   {'loss': 0.843, 'learning_rate': 7.339758749951592e-06, 'epoch': 0.6}
 60%|█████▉    | 2098/3507 [51:19<29:45,  1.27s/it]tensor([[-5.5000, -2.5938,  1.8438,  0.2812, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:06,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.12 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3750, -4.9375, -1.7734,  3.2812, -1.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -3.0312, -0.2676,  3.7500, -0.5508]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -1.2344,  3.2812, -1.8438, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.8438, -5.2812,  0.8789,  1.1016, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5000, -5.0938,  0.9609,  1.3828, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0781, -3.5625, -2.2188,  1.4531, -0.9492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.9688, -6.1562, -0.4473,  1.2812, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:07,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:36:07,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.41 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.04
[2025-11-06 18:36:07,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.56 | bwd: 2.90 | bwd_inner: 1.87 | bwd_allreduce: 0.89 | step: 2.12
 60%|█████▉    | 2099/3507 [51:21<30:38,  1.31s/it]                                                   {'loss': 0.4302, 'learning_rate': 7.330855476671923e-06, 'epoch': 0.6}
 60%|█████▉    | 2099/3507 [51:21<30:38,  1.31s/it]tensor([[-5.8125, -5.2812, -0.7695,  2.8125, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:07,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.25 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9375, -0.8594,  1.6875, -1.0312, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344,  1.6328,  1.8359, -3.4844, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.2344, -0.8047,  1.4844,  0.3184, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.0625, -6.0000, -0.0718,  1.3047, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -2.9375,  2.2656, -1.4766, -6.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -0.7852,  3.5312, -1.0781, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -4.4375, -1.4844,  2.1406, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:08,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:36:08,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.45 | bwd_microstep: 1.88 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.01
[2025-11-06 18:36:08,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 265.70 | bwd: 2.61 | bwd_inner: 1.60 | bwd_allreduce: 0.88 | step: 2.09
 60%|█████▉    | 2100/3507 [51:22<29:49,  1.27s/it]                                                   {'loss': 0.2097, 'learning_rate': 7.321954480893059e-06, 'epoch': 0.6}
 60%|█████▉    | 2100/3507 [51:22<29:49,  1.27s/it]tensor([[-1.1016, -1.7500, -1.9609,  0.7812,  0.3535]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.8750, -5.1250, -0.7695,  2.2812, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -2.8125,  1.8125,  0.8711, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:08,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.88 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7812, -5.3750, -0.5312,  1.6719, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3047, -2.2500, -2.1094,  1.5391,  0.5117]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -2.4844,  2.7031,  0.2402, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3906e+00, -3.3569e-03,  2.2188e+00, -1.5078e+00, -3.4375e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -0.6250,  2.7344, -0.4609, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:09,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:36:09,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.96 | bwd_microstep: 83.68 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 82.25 | step_microstep: 2.79
[2025-11-06 18:36:09,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.88 | bwd: 84.35 | bwd_inner: 1.91 | bwd_allreduce: 82.29 | step: 2.87
 60%|█████▉    | 2101/3507 [51:23<24:15,  1.04s/it]                                                   {'loss': 0.401, 'learning_rate': 7.313055770209961e-06, 'epoch': 0.6}
 60%|█████▉    | 2101/3507 [51:23<24:15,  1.04s/it]tensor([[-4.8438, -3.7500,  0.0669,  1.9922, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -2.7500,  1.5234,  1.9141, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -2.8594,  2.3281,  0.4219, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:10,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.16 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3125, -1.3984,  3.1875, -0.9922, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.9219, -0.3164,  3.1406, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -3.6406, -0.0479,  0.9805, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656,  0.6016,  3.2500, -2.1719, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -3.7344, -0.2324,  0.8750, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:12,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:36:12,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.19 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.95
[2025-11-06 18:36:12,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.37 | bwd: 2.99 | bwd_inner: 2.00 | bwd_allreduce: 0.86 | step: 2.04
 60%|█████▉    | 2102/3507 [51:25<36:45,  1.57s/it]                                                   {'loss': 0.4361, 'learning_rate': 7.304159352215625e-06, 'epoch': 0.6}
 60%|█████▉    | 2102/3507 [51:25<36:45,  1.57s/it]tensor([[-3.7656, -0.1680,  3.5938,  0.3320, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -4.9688, -0.9609,  1.1016, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9375, -5.0312, -0.3535,  2.5781, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:12,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.51 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -0.1011,  3.5781, -2.3125, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5625, -0.3965,  2.7031, -0.5195, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -1.7578,  3.0000,  0.1816, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -2.9062,  0.5000,  1.3203, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -1.2422,  3.0781, -0.4980, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:36:12,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:36:12,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.68 | bwd_microstep: 315.30 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 314.33 | step_microstep: 3.29
[2025-11-06 18:36:12,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.20 | bwd: 315.98 | bwd_inner: 1.45 | bwd_allreduce: 314.37 | step: 3.37
 60%|█████▉    | 2103/3507 [51:26<30:53,  1.32s/it]                                                   {'loss': 0.1067, 'learning_rate': 7.295265234501103e-06, 'epoch': 0.6}
 60%|█████▉    | 2103/3507 [51:26<30:53,  1.32s/it]tensor([[-8.1250, -6.9375, -2.8125, -0.6133, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -2.9219,  1.7031,  2.4688, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:12,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.26 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.0938,  1.3281,  3.5781, -0.0430, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1562,  0.7812,  2.2500, -0.8477, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0781,  0.5781,  3.7812, -0.3945, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438, -0.1875,  2.2188, -1.0469, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.3125, -0.3535,  2.4375, -2.0938, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -2.1562,  2.1719, -0.8789, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:36:14,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:36:14,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 116.73 | bwd_microstep: 380.87 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 379.81 | step_microstep: 1.63
[2025-11-06 18:36:14,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 250.01 | bwd: 381.75 | bwd_inner: 1.73 | bwd_allreduce: 379.84 | step: 1.71
 60%|█████▉    | 2104/3507 [51:28<35:27,  1.52s/it]                                                   {'loss': 0.8222, 'learning_rate': 7.2863734246554785e-06, 'epoch': 0.6}
 60%|█████▉    | 2104/3507 [51:28<35:27,  1.52s/it]tensor([[-3.5312, -3.5469, -0.7188,  2.8281, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:14,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.62 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.6562, -4.9688,  0.9453,  0.7852, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.0000,  0.5625,  1.8047, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -4.2188, -0.6797,  2.9219, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -4.0312, -0.7227,  3.3438, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688, -3.6250, -0.8555,  2.9219, -1.2109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -1.0938,  2.2656, -2.4219, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000,  0.2012,  2.1562, -3.2656, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:36:19,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.17 | optimizer_step: 0.27
[2025-11-06 18:36:19,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.44 | bwd_microstep: 4392.75 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 4391.23 | step_microstep: 2.47
[2025-11-06 18:36:19,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.09 | bwd: 4393.54 | bwd_inner: 2.13 | bwd_allreduce: 4391.28 | step: 2.55
 60%|██████    | 2105/3507 [51:33<58:16,  2.49s/it]                                                   {'loss': 0.6234, 'learning_rate': 7.277483930265865e-06, 'epoch': 0.6}
 60%|██████    | 2105/3507 [51:33<58:16,  2.49s/it]tensor([[-4.9062, -1.6250,  1.7734, -1.0000, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -3.7656,  0.5859,  2.9375, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -4.2812,  0.2520,  2.5469, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:19,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.30 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-3.1875,  0.2617,  2.6094, -0.9922, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -4.8750, -1.2188,  2.4219, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -2.8906,  0.9219,  3.3906, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -2.9688,  0.5469,  1.1875, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -3.3438,  0.8945,  1.0859, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:36:20,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:36:20,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.32 | bwd_microstep: 259.14 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 258.22 | step_microstep: 1.44
[2025-11-06 18:36:20,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.62 | bwd: 259.86 | bwd_inner: 1.50 | bwd_allreduce: 258.25 | step: 1.51
 60%|██████    | 2106/3507 [51:34<45:35,  1.95s/it]                                                   {'loss': 0.1992, 'learning_rate': 7.268596758917395e-06, 'epoch': 0.6}
 60%|██████    | 2106/3507 [51:34<45:35,  1.95s/it]tensor([[-4.6562, -4.8125, -1.5156,  2.7188, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3438, -4.0625, -1.5703,  3.3594, -0.7539]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531,  0.4473,  2.8125, -1.7969, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:20,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.44 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5625, -0.5273,  2.5938, -0.1807, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8828,  1.6953,  2.5469, -1.9375, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -2.4688,  1.1016,  0.8633, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.5312, -5.0938, -0.4375, -0.7031, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -3.0469,  2.4375, -0.3457, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:36:20,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:36:20,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.23 | bwd_microstep: 147.64 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 146.54 | step_microstep: 1.65
[2025-11-06 18:36:20,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.70 | bwd: 148.56 | bwd_inner: 1.85 | bwd_allreduce: 146.58 | step: 1.75
 60%|██████    | 2107/3507 [51:34<35:21,  1.52s/it]                                                   {'loss': 0.2683, 'learning_rate': 7.259711918193231e-06, 'epoch': 0.6}
 60%|██████    | 2107/3507 [51:34<35:21,  1.52s/it]tensor([[-5.0000, -1.3906,  2.8438, -0.2188, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750,  0.5781,  2.9375, -2.0312, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -0.7617,  2.4688, -1.9219, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:20,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.04 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.9375, -5.8125, -0.6758,  2.1562, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688,  0.1118,  3.9844, -1.9297, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344,  0.3809,  2.3594, -1.7031, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -4.5312, -0.5703,  3.0312, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1250, -4.9688,  0.8125,  1.4766, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:36:21,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.26 | optimizer_step: 0.26
[2025-11-06 18:36:21,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.24 | bwd_microstep: 128.89 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 127.70 | step_microstep: 2.72
[2025-11-06 18:36:21,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.31 | bwd: 129.62 | bwd_inner: 1.70 | bwd_allreduce: 127.76 | step: 2.80
 60%|██████    | 2108/3507 [51:35<28:28,  1.22s/it]                                                   {'loss': 0.5775, 'learning_rate': 7.250829415674536e-06, 'epoch': 0.6}
 60%|██████    | 2108/3507 [51:35<28:28,  1.22s/it]tensor([[-2.0625,  1.2266,  5.0625,  1.7578, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:21,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.40 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1875, -4.4062, -0.5469,  2.0312, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.6406, -0.7109,  1.8906, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969,  0.6055,  3.0312, -2.0469, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -4.7500, -1.4688,  3.0312, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -2.7812,  2.5000,  0.1924, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -1.8984,  1.9609, -0.0791, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -0.3027,  2.7500, -0.5586, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:36:21,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:36:21,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.16 | bwd_microstep: 186.64 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 185.31 | step_microstep: 1.98
[2025-11-06 18:36:21,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.58 | bwd: 187.32 | bwd_inner: 1.83 | bwd_allreduce: 185.34 | step: 2.06
 60%|██████    | 2109/3507 [51:35<23:31,  1.01s/it]                                                   {'loss': 0.4086, 'learning_rate': 7.2419492589404885e-06, 'epoch': 0.6}
 60%|██████    | 2109/3507 [51:35<23:31,  1.01s/it]tensor([[-2.8594, -3.4062, -2.1562,  1.5391, -0.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -3.9688, -0.6172,  2.2188, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:21,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.93 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -0.9727,  3.3125, -1.7031, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2812, -3.6406,  2.4375, -0.1631, -5.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9062,  1.1172,  3.2656, -1.7422, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -1.4375,  1.7500, -0.1875, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6875,  0.6328,  4.4688,  1.2422, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250,  0.4082,  3.9844, -1.8516, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:24,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:36:24,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.11 | bwd_microstep: 1.80 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.15
[2025-11-06 18:36:24,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.06 | bwd: 2.57 | bwd_inner: 1.67 | bwd_allreduce: 0.78 | step: 2.22
 60%|██████    | 2110/3507 [51:38<34:04,  1.46s/it]                                                   {'loss': 0.3281, 'learning_rate': 7.233071455568259e-06, 'epoch': 0.6}
 60%|██████    | 2110/3507 [51:38<34:04,  1.46s/it]tensor([[-2.7031, -0.1826,  2.0312, -0.2188, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.9375, -0.6523,  3.4375, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -4.3750, -2.9062,  1.4219, -1.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:24,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.83 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9609,  2.1250,  2.9844, -2.6719, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2188, -4.4062,  0.9062,  2.2344, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -5.1875, -0.5469,  2.9219, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -4.5938, -0.4082,  3.0625, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -0.9727,  2.7031, -0.5273, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:24,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:36:24,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.47 | bwd_microstep: 195.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 195.01 | step_microstep: 1.71
[2025-11-06 18:36:24,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.34 | bwd: 196.78 | bwd_inner: 1.59 | bwd_allreduce: 195.06 | step: 1.79
 60%|██████    | 2111/3507 [51:38<28:00,  1.20s/it]                                                   {'loss': 0.5657, 'learning_rate': 7.2241960131330046e-06, 'epoch': 0.6}
 60%|██████    | 2111/3507 [51:38<28:00,  1.20s/it]tensor([[-5.1875, -5.4688, -2.2812,  2.3281, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -3.8906,  0.6445,  2.7500, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312,  1.2109,  3.6094, -2.2188, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:25,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.10 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.1250, -5.3125, -0.1699,  1.2266, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4375, -5.4375, -0.7148,  0.0413, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2188, -0.5039,  2.0312, -0.1045, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -3.2656,  2.2969,  1.3516, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -3.3438,  0.7031, -0.5742, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:27,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:36:27,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.30 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.00
[2025-11-06 18:36:27,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 434.34 | bwd: 2.62 | bwd_inner: 1.74 | bwd_allreduce: 0.76 | step: 2.08
 60%|██████    | 2112/3507 [51:41<40:08,  1.73s/it]                                                   {'loss': 0.2041, 'learning_rate': 7.215322939207874e-06, 'epoch': 0.6}
 60%|██████    | 2112/3507 [51:41<40:08,  1.73s/it]tensor([[-3.2656, -0.9727,  1.6953,  0.4395, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.1094, -0.4277, -1.0078, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9180,  1.2188,  1.4297, -0.3594, -0.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:28,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.10 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.8672,  1.6719,  2.2656, -2.4531, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0000, -4.5000, -1.2734,  1.5625, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -4.9688, -0.8594,  3.2812, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.2656,  0.5742,  1.1484, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -2.8750,  1.0781,  0.6758, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:28,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:36:28,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.35 | bwd_microstep: 425.00 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 424.09 | step_microstep: 2.06
[2025-11-06 18:36:28,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.47 | bwd: 425.71 | bwd_inner: 1.44 | bwd_allreduce: 424.13 | step: 2.14
 60%|██████    | 2113/3507 [51:42<33:52,  1.46s/it]                                                   {'loss': 0.6964, 'learning_rate': 7.206452241363999e-06, 'epoch': 0.6}
 60%|██████    | 2113/3507 [51:42<33:52,  1.46s/it]tensor([[-3.9688, -0.3086,  2.6250, -1.4375, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:28,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.06 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1562, -4.6875, -1.0859,  2.2656, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1094,  2.0469,  3.7500, -2.2812, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7500, -5.9375, -1.5938,  1.4531, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9062, -4.6875,  0.3516,  0.8555, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -4.2500,  0.6680,  2.8750, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -1.5000,  0.5195,  0.8398, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -3.0781,  2.3750,  0.0854, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:29,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.26
[2025-11-06 18:36:29,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.49 | bwd_microstep: 773.43 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 772.56 | step_microstep: 2.01
[2025-11-06 18:36:29,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.57 | bwd: 774.32 | bwd_inner: 1.57 | bwd_allreduce: 772.60 | step: 2.10
 60%|██████    | 2114/3507 [51:43<31:33,  1.36s/it]                                                   {'loss': 0.2084, 'learning_rate': 7.197583927170478e-06, 'epoch': 0.6}
 60%|██████    | 2114/3507 [51:43<31:33,  1.36s/it]tensor([[-6.7500, -5.5625, -0.6523,  1.9219, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:29,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.39 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4375, -5.4062, -1.7422,  2.3125, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812, -3.2344, -2.0156,  2.4219, -0.0291]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.7344, -0.4414,  1.1328, -2.7656, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.5938, -4.5625,  0.3418,  3.0938, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -1.7188,  1.7344,  0.5039, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1250, -5.7188, -0.2178,  2.2188, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -5.0625, -2.3125,  2.6250, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:36:31,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:36:31,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.34 | bwd_microstep: 1307.37 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1306.15 | step_microstep: 2.02
[2025-11-06 18:36:31,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 244.75 | bwd: 1308.35 | bwd_inner: 2.04 | bwd_allreduce: 1306.19 | step: 2.10
 60%|██████    | 2115/3507 [51:45<33:04,  1.43s/it]                                                   {'loss': 0.614, 'learning_rate': 7.1887180041943746e-06, 'epoch': 0.6}
 60%|██████    | 2115/3507 [51:45<33:04,  1.43s/it]tensor([[-3.4531, -1.1562,  1.8906,  0.6406, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -1.2969,  2.1719, -0.9023, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8438, -3.3438,  2.2344, -0.1670, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[2.5781, 5.3750, 6.4375, 2.7812, 1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -1.8203,  3.5469, -0.2119, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -4.3750, -0.6211,  2.8281, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -4.6562, -0.2090,  1.8281, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:33,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.0312, -3.8438,  0.9062,  0.7227, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:33,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:36:33,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.45 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.35
[2025-11-06 18:36:33,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.08 | bwd: 3.06 | bwd_inner: 2.00 | bwd_allreduce: 0.93 | step: 2.44
 60%|██████    | 2116/3507 [51:47<38:15,  1.65s/it]                                                   {'loss': 0.6934, 'learning_rate': 7.1798544800007205e-06, 'epoch': 0.6}
 60%|██████    | 2116/3507 [51:47<38:15,  1.65s/it]tensor([[-3.7656, -0.7734,  2.2812, -0.0043, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -3.7031,  0.8984, -0.0184, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -4.4688, -0.3770,  3.2344, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:33,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.41 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.5000, -4.0625,  1.3984, -0.4844, -6.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -3.0938,  1.8906,  3.0781, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.8281,  0.1924,  1.4688, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -2.2031,  0.9805,  1.8516, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5781,  1.0859,  3.7500, -2.6250, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:34,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:36:34,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.58 | bwd_microstep: 404.57 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 403.51 | step_microstep: 1.66
[2025-11-06 18:36:34,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.02 | bwd: 405.25 | bwd_inner: 1.56 | bwd_allreduce: 403.55 | step: 1.75
 60%|██████    | 2117/3507 [51:48<32:28,  1.40s/it]                                                   {'loss': 0.5556, 'learning_rate': 7.170993362152488e-06, 'epoch': 0.6}
 60%|██████    | 2117/3507 [51:48<32:28,  1.40s/it]tensor([[-4.5000, -2.8906,  0.4961,  0.9414, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -3.9688, -1.5078,  1.8359, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -2.2812,  2.4062,  1.3672, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -1.6016,  3.1250, -1.0078, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -4.9375,  0.4746,  2.1250, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6875,  0.9688,  2.2812, -2.1875, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.1250, -4.5312,  0.5430,  0.2275, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:36,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.79 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-6.0312, -3.6250,  1.2812,  1.3750, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:37,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:36:37,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.96 | bwd_microstep: 2.09 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.19
[2025-11-06 18:36:37,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.76 | bwd: 3.15 | bwd_inner: 2.03 | bwd_allreduce: 0.92 | step: 2.31
 60%|██████    | 2118/3507 [51:50<41:32,  1.79s/it]                                                   {'loss': 1.0528, 'learning_rate': 7.162134658210602e-06, 'epoch': 0.6}
 60%|██████    | 2118/3507 [51:50<41:32,  1.79s/it]tensor([[ 0.1904,  3.6562,  3.3438, -1.6094, -1.0859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8125, -4.5938, -2.2188,  2.8125, -1.0859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -0.7422,  1.6641, -1.7578, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -1.2969,  1.6953,  0.3379, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:37,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 265.88 | bwd_microstep: 1.09 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.4375, -3.5938,  0.6211,  1.1953, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4531,  1.2422,  4.1875,  0.0075, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -1.3281,  2.1875, -0.9453, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -2.4219,  0.6680,  0.4102, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:37,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 18:36:37,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.45 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.54
[2025-11-06 18:36:37,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 467.35 | bwd: 3.03 | bwd_inner: 2.17 | bwd_allreduce: 0.75 | step: 1.64
 60%|██████    | 2119/3507 [51:51<32:35,  1.41s/it]                                                   {'loss': 0.3802, 'learning_rate': 7.153278375733935e-06, 'epoch': 0.6}
 60%|██████    | 2119/3507 [51:51<32:35,  1.41s/it]tensor([[-7.0312, -4.5938, -0.1494, -0.6367, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -4.3438,  0.1709,  1.7969, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0312,  0.5508,  3.7500,  2.3906, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -5.6562, -0.8789,  2.2969, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7188, -4.2812,  1.3906,  1.5938, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969,  0.2676,  3.8438, -0.8047, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.7500, -5.5625,  0.6172,  1.5703, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:39,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.84 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -0.8750,  1.9531, -1.1484, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:39,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:36:39,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.57 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.55
[2025-11-06 18:36:39,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.44 | bwd: 2.90 | bwd_inner: 1.89 | bwd_allreduce: 0.90 | step: 2.63
 60%|██████    | 2120/3507 [51:53<39:09,  1.69s/it]                                                   {'loss': 0.571, 'learning_rate': 7.144424522279283e-06, 'epoch': 0.6}
 60%|██████    | 2120/3507 [51:53<39:09,  1.69s/it]tensor([[-3.8281, -0.3906,  2.3594, -1.1562, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:36:40,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.38 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -1.2344,  3.2188, -0.8672, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4688, -3.0312,  2.3438,  0.0703, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -1.7734,  2.5625, -0.9297, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -1.5391,  2.4375, -0.9258, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.7031, -1.4844, -2.0156,  0.8359,  0.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.9062, -4.1562, -0.3027,  2.4688, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7500,  1.2812,  3.3438, -2.0781, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:40,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.28 | optimizer_step: 0.19
[2025-11-06 18:36:40,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.71 | bwd_microstep: 457.51 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 456.49 | step_microstep: 2.14
[2025-11-06 18:36:40,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.12 | bwd: 458.53 | bwd_inner: 1.85 | bwd_allreduce: 456.53 | step: 2.22
 60%|██████    | 2121/3507 [51:54<33:03,  1.43s/it]                                                   {'loss': 0.5245, 'learning_rate': 7.135573105401375e-06, 'epoch': 0.6}
 60%|██████    | 2121/3507 [51:54<33:03,  1.43s/it]tensor([[-7.7812, -5.5000,  0.2871,  0.8477, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4688, -3.2812, -0.2988,  3.0156, -1.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4688, -2.7031,  2.4844, -0.6055, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.8906,  0.2480,  2.5469, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -4.6875, -0.2480,  2.1562, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -4.0312,  0.2168,  2.2812, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -5.2812, -0.7383,  1.4844, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:41,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.07 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5000, -5.2500, -0.6719,  1.5547, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:41,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:36:41,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.83 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.76
[2025-11-06 18:36:41,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.86 | bwd: 2.85 | bwd_inner: 1.79 | bwd_allreduce: 0.91 | step: 1.85
 61%|██████    | 2122/3507 [51:55<28:16,  1.23s/it]                                                   {'loss': 0.4943, 'learning_rate': 7.126724132652854e-06, 'epoch': 0.61}
 61%|██████    | 2122/3507 [51:55<28:16,  1.23s/it]tensor([[-7.5000, -5.5000,  0.3242,  1.5938, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:41,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.16 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.6875, -5.1562,  0.3750,  0.3809, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -2.7656,  0.7422,  0.5820, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844,  1.4844,  3.4219, -3.0000, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0312, -4.1875,  0.4785, -0.7656, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.0000, -6.8750, -1.0156,  0.3281, -6.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -1.8125,  2.4688, -1.2969, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -2.8750,  1.3672,  1.7969, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:42,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:36:42,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.68 | bwd_microstep: 945.44 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 944.33 | step_microstep: 1.89
[2025-11-06 18:36:42,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.85 | bwd: 946.36 | bwd_inner: 1.85 | bwd_allreduce: 944.37 | step: 1.96
 61%|██████    | 2123/3507 [51:56<28:35,  1.24s/it]                                                   {'loss': 0.9022, 'learning_rate': 7.117877611584287e-06, 'epoch': 0.61}
 61%|██████    | 2123/3507 [51:56<28:35,  1.24s/it]tensor([[-7.5938, -4.5000,  1.1641, -0.2275, -5.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -4.1562, -0.4766,  2.7188, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -2.0312,  3.0469, -0.3613, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -4.7812, -2.3281,  2.3438, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -1.1328,  2.5781, -2.2188, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8125, -4.4375, -1.8281,  2.8906, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -4.4375, -0.9609,  1.7734, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:43,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.19 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -0.2539,  3.1250, -0.9609, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:44,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:36:44,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.72 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.03
[2025-11-06 18:36:44,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.93 | bwd: 2.88 | bwd_inner: 1.91 | bwd_allreduce: 0.84 | step: 2.12
 61%|██████    | 2124/3507 [51:57<28:47,  1.25s/it]                                                   {'loss': 0.0662, 'learning_rate': 7.109033549744141e-06, 'epoch': 0.61}
 61%|██████    | 2124/3507 [51:57<28:47,  1.25s/it]tensor([[-5.0312, -2.7969,  2.1719,  2.0938, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -4.9062, -1.0703,  3.7031, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.6719,  0.1816,  2.0312, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:44,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.09 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.2188, -3.5625,  0.4062, -0.5664, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -0.5859,  1.9219, -1.7031, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -3.8281, -1.2031,  2.1562, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0442,  3.4844,  4.0312, -0.8281, -1.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.8594, -1.0859,  1.3750,  0.9336, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:36:46,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:36:46,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.43 | bwd_microstep: 1719.18 | bwd_inner_microstep: 1.55 | bwd_allreduce_microstep: 1717.49 | step_microstep: 2.24
[2025-11-06 18:36:46,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.53 | bwd: 1720.26 | bwd_inner: 2.55 | bwd_allreduce: 1717.54 | step: 2.33
 61%|██████    | 2125/3507 [52:00<37:02,  1.61s/it]                                                   {'loss': 0.5533, 'learning_rate': 7.100191954678792e-06, 'epoch': 0.61}
 61%|██████    | 2125/3507 [52:00<37:02,  1.61s/it]tensor([[-3.7656,  0.1445,  3.2656, -1.2891, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -4.4688, -0.3145,  3.4844, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -3.6875, -0.6758,  3.5156, -1.0859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406,  0.7812,  3.5781, -0.1348, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -4.5000,  0.2617,  1.0156, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2344, -1.1094,  0.9492,  1.3516, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9219, -2.6250, -1.3750,  2.5781,  0.0615]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:47,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.46 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7812, -4.2812,  0.7578,  2.7500, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:47,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:36:47,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 1.94 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.00
[2025-11-06 18:36:47,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.28 | bwd: 2.77 | bwd_inner: 1.71 | bwd_allreduce: 0.92 | step: 2.10
 61%|██████    | 2126/3507 [52:01<32:33,  1.41s/it]                                                   {'loss': 0.1797, 'learning_rate': 7.091352833932508e-06, 'epoch': 0.61}
 61%|██████    | 2126/3507 [52:01<32:33,  1.41s/it]tensor([[-6.1875, -4.0938,  1.3203,  2.1094, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -3.4844, -0.7891,  1.9453, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -2.7656,  1.6406,  1.9453, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:47,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.94 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7656,  0.1318,  3.7031, -0.6641, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -3.8438, -0.8203,  3.5938, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5547,  2.7656,  3.8750,  0.0369, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562,  0.5938,  3.6875, -1.4531, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -2.5312,  0.9258,  1.1797, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:49,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:36:49,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.50 | bwd_microstep: 1841.78 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1840.70 | step_microstep: 2.35
[2025-11-06 18:36:49,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.46 | bwd: 1842.67 | bwd_inner: 1.75 | bwd_allreduce: 1840.75 | step: 2.44
 61%|██████    | 2127/3507 [52:03<38:27,  1.67s/it]                                                   {'loss': 0.2887, 'learning_rate': 7.082516195047444e-06, 'epoch': 0.61}
 61%|██████    | 2127/3507 [52:03<38:27,  1.67s/it]tensor([[-2.2500,  1.5312,  3.1875, -1.7188, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -3.5469,  0.6367,  2.9531, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:49,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.38 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4219e+00,  4.9609e-01,  4.1875e+00, -3.3112e-03, -3.5000e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -4.1562, -0.4688,  2.5469, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -3.5781, -0.3105,  2.1719, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -4.5312, -1.2031,  3.3438, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-6.0000, -4.8750, -0.5273,  2.0469, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688,  0.2002,  4.1562,  0.4316, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:36:50,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:36:50,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.77 | bwd_microstep: 100.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 100.00 | step_microstep: 1.83
[2025-11-06 18:36:50,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.16 | bwd: 103.01 | bwd_inner: 2.80 | bwd_allreduce: 100.04 | step: 1.92
 61%|██████    | 2128/3507 [52:04<29:59,  1.30s/it]                                                   {'loss': 0.6919, 'learning_rate': 7.073682045563632e-06, 'epoch': 0.61}
 61%|██████    | 2128/3507 [52:04<29:59,  1.30s/it]tensor([[-3.7500, -4.1875, -1.7500,  2.6719, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2188, -5.1875,  0.7617,  1.9375, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:50,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.39 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0000, -3.3281,  0.2197,  1.2422, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -3.1406,  0.8008,  1.4688, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1250, -4.8750,  0.2432,  2.6562, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -2.4062,  3.0781,  0.5117, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -3.5312,  1.5938,  0.4688, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1562, -3.9844,  1.6016,  0.0713, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:55,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:36:55,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.85 | bwd_microstep: 4633.95 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 4632.92 | step_microstep: 2.66
[2025-11-06 18:36:55,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.26 | bwd: 4634.79 | bwd_inner: 1.63 | bwd_allreduce: 4632.98 | step: 2.76
 61%|██████    | 2129/3507 [52:09<55:36,  2.42s/it]                                                   {'loss': 0.444, 'learning_rate': 7.064850393018996e-06, 'epoch': 0.61}
 61%|██████    | 2129/3507 [52:09<55:36,  2.42s/it]tensor([[-5.3750, -5.5312, -1.9531,  2.6094, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:55,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.48 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2500, -4.1250, -0.0947,  2.2656, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -2.5469,  1.5625,  1.4609, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1250, -3.7812, -1.0469,  3.9062, -0.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2812,  0.0952,  2.3906, -1.7031, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.2812, -2.9844,  2.4531,  0.4707, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -5.3125, -1.9375,  1.9375, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -0.6016,  3.6094, -0.6562, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:55,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:36:55,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.77 | bwd_microstep: 131.50 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 130.09 | step_microstep: 2.82
[2025-11-06 18:36:55,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.26 | bwd: 132.54 | bwd_inner: 2.25 | bwd_allreduce: 130.14 | step: 2.90
 61%|██████    | 2130/3507 [52:09<42:18,  1.84s/it]                                                   {'loss': 0.432, 'learning_rate': 7.056021244949315e-06, 'epoch': 0.61}
 61%|██████    | 2130/3507 [52:09<42:18,  1.84s/it]tensor([[-7.4062, -4.0938,  0.7227, -1.1328, -6.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -5.6562, -1.5859,  2.1094, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -5.2188, -1.4766,  1.9453, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0938, -5.8125, -0.5508,  1.9297, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:55,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.19 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -2.7500,  0.6602,  1.7188, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -4.0312,  0.2031,  1.0391, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5352,  2.1250,  1.9609, -0.6758, -0.9727]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5938e+00, -1.9844e+00,  1.4453e+00, -1.9073e-03, -3.7188e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:36:56,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:36:56,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.30 | bwd_microstep: 538.85 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 537.81 | step_microstep: 2.12
[2025-11-06 18:36:56,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.48 | bwd: 539.77 | bwd_inner: 1.78 | bwd_allreduce: 537.85 | step: 2.20
 61%|██████    | 2131/3507 [52:10<36:13,  1.58s/it]                                                   {'loss': 0.8035, 'learning_rate': 7.047194608888233e-06, 'epoch': 0.61}
 61%|██████    | 2131/3507 [52:10<36:13,  1.58s/it]tensor([[-5.5312, -4.0625,  1.1172,  3.2500, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -5.0312, -2.8438,  1.8984, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:36:56,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.00 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1250, -2.6875,  1.2578,  0.3516, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -4.7500,  0.4492,  2.4062, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0156, -3.7969, -2.1719,  2.3125, -0.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-7.0938, -4.4688,  1.0859,  0.8945, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -1.6484,  2.1406,  1.5469, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -4.3750, -0.1260,  2.6719, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:36:57,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:36:57,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.13 | bwd_microstep: 89.65 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 88.64 | step_microstep: 1.62
[2025-11-06 18:36:57,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.15 | bwd: 90.46 | bwd_inner: 1.65 | bwd_allreduce: 88.67 | step: 1.69
 61%|██████    | 2132/3507 [52:10<28:17,  1.23s/it]                                                   {'loss': 0.6029, 'learning_rate': 7.038370492367261e-06, 'epoch': 0.61}
 61%|██████    | 2132/3507 [52:10<28:17,  1.23s/it]tensor([[-5.0000, -2.4375,  1.5156,  0.7031, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -4.1875, -0.4473,  3.2500, -1.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -5.6875, -1.5625,  2.0625, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -3.1250,  1.7969,  0.6602, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:36:57,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.97 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -4.6875, -1.0156,  0.3223, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -4.8438, -0.6523,  3.0000, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -3.3125,  2.0625,  1.4453, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -0.6523,  1.6641, -2.4062, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:00,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:37:00,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.53 | bwd_microstep: 2317.11 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2316.00 | step_microstep: 2.90
[2025-11-06 18:37:00,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 482.52 | bwd: 2317.99 | bwd_inner: 1.79 | bwd_allreduce: 2316.05 | step: 2.98
 61%|██████    | 2133/3507 [52:13<40:18,  1.76s/it]                                                   {'loss': 0.6482, 'learning_rate': 7.029548902915746e-06, 'epoch': 0.61}
 61%|██████    | 2133/3507 [52:13<40:18,  1.76s/it]tensor([[-6.1562, -3.8594,  1.6719,  2.1719, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:00,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.01 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2812, -5.6875, -1.7109,  1.5469, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -5.7188, -0.4668,  2.5312, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -2.8594, -0.4023,  1.2891, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9375, -4.5625,  1.0391,  1.3906, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -2.7031,  1.2422,  1.5000, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -2.9219,  1.8984,  1.1172, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2188, -3.8906,  1.7500,  2.0469, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:00,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:37:00,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.34 | bwd_microstep: 314.69 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 313.84 | step_microstep: 2.18
[2025-11-06 18:37:00,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.36 | bwd: 315.40 | bwd_inner: 1.37 | bwd_allreduce: 313.89 | step: 2.25
 61%|██████    | 2134/3507 [52:14<32:32,  1.42s/it]                                                   {'loss': 0.6117, 'learning_rate': 7.020729848060886e-06, 'epoch': 0.61}
 61%|██████    | 2134/3507 [52:14<32:32,  1.42s/it]tensor([[-5.2500, -4.5312, -0.5938,  2.0938, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -5.1562, -1.3672,  2.5781, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -1.0625,  2.1250, -0.2891, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6250, -5.4688, -0.2676,  2.5781, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:00,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.07 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15
tensor([[-4.6562, -4.4688, -0.8477,  2.9688, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0312,  0.9805,  3.4844, -1.2188, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -0.8281,  3.4219, -2.0156, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -4.5938, -0.3789,  0.7578, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:37:01,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.28 | optimizer_step: 0.21
[2025-11-06 18:37:01,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.39 | bwd_microstep: 653.72 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 652.35 | step_microstep: 2.45
[2025-11-06 18:37:01,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.51 | bwd: 654.68 | bwd_inner: 2.01 | bwd_allreduce: 652.44 | step: 2.62
 61%|██████    | 2135/3507 [52:15<30:15,  1.32s/it]                                                   {'loss': 0.5655, 'learning_rate': 7.011913335327718e-06, 'epoch': 0.61}
 61%|██████    | 2135/3507 [52:15<30:15,  1.32s/it]tensor([[-7.8125, -6.8438, -1.6562,  1.8516, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -1.5312,  1.3516, -1.2188, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:01,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.73 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0938,  0.2930,  2.6250, -0.8672, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.9844,  0.1865,  3.7188, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -1.6953,  2.6250,  0.0884, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9062, -2.2031,  3.0625,  0.1943, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -3.7500,  0.6797,  2.3281, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188,  0.7461,  3.8906, -1.6250, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:03,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:37:03,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.67 | bwd_microstep: 1318.38 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 1317.45 | step_microstep: 5.05
[2025-11-06 18:37:03,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.43 | bwd: 1319.21 | bwd_inner: 1.55 | bwd_allreduce: 1317.50 | step: 5.14
 61%|██████    | 2136/3507 [52:17<33:26,  1.46s/it]                                                   {'loss': 0.0875, 'learning_rate': 7.003099372239105e-06, 'epoch': 0.61}
 61%|██████    | 2136/3507 [52:17<33:26,  1.46s/it]tensor([[-3.8906, -0.7773,  2.9844,  0.8594, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0000, -5.5938, -0.0674,  2.3281, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:03,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.76 | bwd_microstep: 5.37 | bwd_inner_microstep: 5.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-4.2812, -0.8789,  2.1875, -0.8711, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -4.0000,  0.2637,  0.5625, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.8320,  3.1719,  4.8750, -0.6094, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.2188, -3.6562,  0.6562,  1.9688, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.2188, -0.1982,  1.6172, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2344, -0.8711,  0.5430,  0.3457, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:06,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.29 | optimizer_step: 0.35
[2025-11-06 18:37:06,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.50 | bwd_microstep: 2427.14 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2425.89 | step_microstep: 9.81
[2025-11-06 18:37:06,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 516.30 | bwd: 2432.51 | bwd_inner: 6.37 | bwd_allreduce: 2425.95 | step: 9.94
 61%|██████    | 2137/3507 [52:20<44:01,  1.93s/it]                                                   {'loss': 0.8337, 'learning_rate': 6.994287966315736e-06, 'epoch': 0.61}
 61%|██████    | 2137/3507 [52:20<44:01,  1.93s/it]tensor([[-5.0000, -4.0625, -0.0908,  2.3281, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -3.9531, -2.3906,  2.0469, -0.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0312, -2.3594,  0.9336,  3.5625, -1.1484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -2.9688,  2.5000,  0.9688, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:06,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.32 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7188, -4.9375,  0.7109,  2.3438, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -4.1562, -1.0703,  2.6562, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5938, -3.6719, -0.9453,  2.8281, -1.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -5.1562, -2.5156,  1.4688, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:07,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:37:07,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.28 | bwd_microstep: 6.51 | bwd_inner_microstep: 5.53 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.14
[2025-11-06 18:37:07,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 448.62 | bwd: 7.48 | bwd_inner: 6.39 | bwd_allreduce: 0.93 | step: 7.22
 61%|██████    | 2138/3507 [52:20<34:17,  1.50s/it]                                                   {'loss': 0.8397, 'learning_rate': 6.985479125076125e-06, 'epoch': 0.61}
 61%|██████    | 2138/3507 [52:20<34:17,  1.50s/it]tensor([[-0.8750,  1.8125,  1.6484, -1.7734, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.0469, -1.3047,  0.7266,  2.2969, -0.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -1.6172,  3.1562,  0.8867, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:07,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.60 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.4375, -3.3750,  2.5625,  1.2344, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281,  0.1670,  2.8438, -1.8516, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -3.9688, -0.6758,  3.6562, -1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.0625, -6.6250, -0.7188,  1.9609, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9062, -3.5625,  1.9141, -0.2793, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:10,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.23 | optimizer_step: 0.25
[2025-11-06 18:37:10,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.07 | bwd_microstep: 3369.19 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 3368.05 | step_microstep: 3.33
[2025-11-06 18:37:10,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.72 | bwd: 3370.03 | bwd_inner: 1.75 | bwd_allreduce: 3368.10 | step: 3.42
 61%|██████    | 2139/3507 [52:24<49:51,  2.19s/it]                                                   {'loss': 0.1958, 'learning_rate': 6.976672856036586e-06, 'epoch': 0.61}
 61%|██████    | 2139/3507 [52:24<49:51,  2.19s/it]tensor([[-5.1875, -1.2969,  3.5781, -0.1396, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[h264 @ 0xc334b00] SEI type 0 size 64 truncated at 56
[h264 @ 0xc40aa00] SEI type 0 size 64 truncated at 56
tensor([[-5.7812, -5.3125, -1.0938,  2.3125, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:11,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.13 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
[h264 @ 0xc40aa00] SEI type 0 size 64 truncated at 56
tensor([[-5.3125, -3.2656,  0.7227,  0.8516, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -5.3750, -2.8125,  2.4062, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -0.4199,  3.9375,  1.5859, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -4.6875, -1.1484,  2.7656, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7188, -5.1562,  0.2266,  2.2969, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -4.8438, -0.6523,  2.9844, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:11,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:37:11,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.84 | bwd_microstep: 65.03 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 63.99 | step_microstep: 1.43
[2025-11-06 18:37:11,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.00 | bwd: 65.98 | bwd_inner: 1.80 | bwd_allreduce: 64.04 | step: 1.52
 61%|██████    | 2140/3507 [52:25<37:43,  1.66s/it]                                                   {'loss': 0.1277, 'learning_rate': 6.967869166711243e-06, 'epoch': 0.61}
 61%|██████    | 2140/3507 [52:25<37:43,  1.66s/it]tensor([[-6.2188, -2.3281,  3.2969, -0.1816, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:11,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.24 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5625, -3.6875, -1.0469,  2.6250, -1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6875,  0.4707,  2.4062, -0.9062, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -2.8438,  1.5859,  2.4375, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -1.6484,  1.1406,  0.8828, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1875, -2.7656,  2.3281, -0.3027, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969,  1.0391,  2.3750, -2.2500, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6562, -1.7812,  3.2031, -0.6445, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:13,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:37:13,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.94 | bwd_microstep: 1764.46 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1763.30 | step_microstep: 1.85
[2025-11-06 18:37:13,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.21 | bwd: 1765.39 | bwd_inner: 1.86 | bwd_allreduce: 1763.35 | step: 1.94
 61%|██████    | 2141/3507 [52:27<40:43,  1.79s/it]                                                   {'loss': 0.3671, 'learning_rate': 6.959068064612022e-06, 'epoch': 0.61}
 61%|██████    | 2141/3507 [52:27<40:43,  1.79s/it]tensor([[-2.6406, -0.4922,  1.4219, -0.1514, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469,  0.7617,  2.1719, -2.3594, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:13,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.99 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-3.4688, -3.5156, -0.3047,  3.5000, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4531, -0.1514,  1.6875,  0.1196, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -4.6875, -2.0312,  1.6094, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -4.1875, -0.3848,  1.8828, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -0.8242,  2.5938, -1.0000, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -3.2344,  1.4141,  1.6641, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:13,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:37:13,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.89 | bwd_microstep: 8.58 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 7.37 | step_microstep: 1.48
[2025-11-06 18:37:13,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.90 | bwd: 9.50 | bwd_inner: 1.98 | bwd_allreduce: 7.40 | step: 1.55
 61%|██████    | 2142/3507 [52:27<31:10,  1.37s/it]                                                   {'loss': 0.2184, 'learning_rate': 6.950269557248639e-06, 'epoch': 0.61}
 61%|██████    | 2142/3507 [52:27<31:10,  1.37s/it]tensor([[-4.7812, -1.4531,  2.3125, -0.8359, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:13,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.48 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.7188,  0.0554,  1.8594, -2.7344, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.0938, -3.7969, -0.8008,  2.3438, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -3.1562,  0.8203,  1.9688, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.5000,  0.2754,  0.9336, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -4.2188, -0.2344,  2.5156, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -3.1562,  1.2031,  1.4062, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1875,  0.1299,  1.0547, -2.5000, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:37:17,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 18:37:17,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.05 | bwd_microstep: 3686.75 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 3685.49 | step_microstep: 3.21
[2025-11-06 18:37:17,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.45 | bwd: 3687.71 | bwd_inner: 2.03 | bwd_allreduce: 3685.54 | step: 3.30
 61%|██████    | 2143/3507 [52:31<49:09,  2.16s/it]                                                   {'loss': 0.7676, 'learning_rate': 6.941473652128598e-06, 'epoch': 0.61}
 61%|██████    | 2143/3507 [52:31<49:09,  2.16s/it]tensor([[-5.9375, -5.2500, -1.1016,  1.8750, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:18,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.07 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-6.5938, -6.2812, -1.9766,  2.0781, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -4.3438,  0.1328,  3.0156, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -1.0547,  1.1875, -0.2451, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6875,  1.7422,  4.0625, -1.7812, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7188, -0.9414,  2.2812,  0.5781, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.4688, -5.5000, -0.6367,  0.3496, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -5.1562, -0.3320,  2.5312, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:37:18,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:37:18,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 89.58 | bwd_microstep: 244.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 244.01 | step_microstep: 1.43
[2025-11-06 18:37:18,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 267.66 | bwd: 245.64 | bwd_inner: 1.47 | bwd_allreduce: 244.04 | step: 1.49
 61%|██████    | 2144/3507 [52:32<38:06,  1.68s/it]                                                   {'loss': 0.4332, 'learning_rate': 6.932680356757173e-06, 'epoch': 0.61}
 61%|██████    | 2144/3507 [52:32<38:06,  1.68s/it]tensor([[-6.4375, -2.9844,  2.7969,  0.5508, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5469, -0.4180,  2.9219,  0.2891, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -0.1270,  3.3125, -1.8125, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:18,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.01 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-8.8750, -6.5938, -0.0133,  1.1797, -6.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4375, -5.3125, -0.8242,  1.1875, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -1.1797,  3.0781, -1.8984, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5312, -0.5312,  2.8750, -1.4688, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -2.5781,  2.6094, -0.1138, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:20,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:37:20,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.61 | bwd_microstep: 1721.63 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 1720.65 | step_microstep: 1.88
[2025-11-06 18:37:20,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.64 | bwd: 1722.41 | bwd_inner: 1.60 | bwd_allreduce: 1720.68 | step: 1.95
 61%|██████    | 2145/3507 [52:34<41:14,  1.82s/it]                                                   {'loss': 0.2471, 'learning_rate': 6.923889678637425e-06, 'epoch': 0.61}
 61%|██████    | 2145/3507 [52:34<41:14,  1.82s/it]tensor([[-6.1875, -4.4688,  0.8555,  2.4531, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8672,  1.8359,  3.0469, -2.1250, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:20,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.39 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-9.4375, -9.6875, -5.9688, -0.5625, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.4688,  0.6602,  2.9219, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281,  0.2490,  2.3125, -1.8594, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8125, -4.5625, -2.3594,  2.6562, -1.0391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.7344, -0.0491,  2.3125, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -1.6797,  2.0938, -1.5781, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:37:20,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:37:20,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.98 | bwd_microstep: 132.78 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 131.71 | step_microstep: 2.04
[2025-11-06 18:37:20,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.39 | bwd: 133.56 | bwd_inner: 1.69 | bwd_allreduce: 131.75 | step: 2.12
 61%|██████    | 2146/3507 [52:34<31:50,  1.40s/it]                                                   {'loss': 0.1123, 'learning_rate': 6.915101625270175e-06, 'epoch': 0.61}
 61%|██████    | 2146/3507 [52:34<31:50,  1.40s/it]tensor([[-4.1562, -4.3125, -1.2969,  2.7969, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -4.0625, -0.7969,  2.4844, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -1.6094,  1.9297, -0.2285, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-10.1875,  -7.9688,  -3.4219,  -2.8438,  -7.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -4.1250,  0.7109,  2.5938, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:21,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.85 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[1.2109, 1.4062, 3.9375, 6.7812, 2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7500, -5.9688, -1.3750,  1.8984, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0156,  1.7031,  4.9062,  0.5508, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:21,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:37:21,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.61 | bwd_microstep: 187.60 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 186.47 | step_microstep: 1.70
[2025-11-06 18:37:21,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 511.49 | bwd: 188.44 | bwd_inner: 1.80 | bwd_allreduce: 186.50 | step: 1.78
 61%|██████    | 2147/3507 [52:35<27:19,  1.21s/it]                                                   {'loss': 0.1957, 'learning_rate': 6.906316204154002e-06, 'epoch': 0.61}
 61%|██████    | 2147/3507 [52:35<27:19,  1.21s/it]tensor([[0.0454, 0.7656, 3.7188, 5.8750, 1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500, -0.3164,  2.7188, -0.2988, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6250, -4.5312,  0.5117,  1.0859, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:21,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.79 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.7188, -2.1719, -0.5977,  3.1094,  0.2148]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -4.6562, -0.3496,  1.5312, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -4.0312,  1.5156,  1.7266, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -4.4688, -0.4434,  1.2891, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -4.4062, -0.9609,  3.2188, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:22,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:37:22,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.63 | bwd_microstep: 190.63 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 189.59 | step_microstep: 1.73
[2025-11-06 18:37:22,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.44 | bwd: 191.68 | bwd_inner: 1.92 | bwd_allreduce: 189.63 | step: 1.81
 61%|██████    | 2148/3507 [52:36<22:50,  1.01s/it]                                                   {'loss': 0.2846, 'learning_rate': 6.897533422785245e-06, 'epoch': 0.61}
 61%|██████    | 2148/3507 [52:36<22:50,  1.01s/it]tensor([[-3.7344, -0.8711,  2.6875,  0.6367, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5625,  0.5195,  3.2188, -1.7578, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:22,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.72 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5625, -3.1875,  1.2031,  2.6719, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -0.2119,  2.5312, -2.3438, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250,  0.9688,  3.4062, -1.7422, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -3.5312,  0.8984,  1.4375, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656,  0.9336,  3.1094, -2.3281, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -5.0000, -3.0469,  1.7188, -1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:37:23,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:37:23,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.10 | bwd_microstep: 472.41 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 471.36 | step_microstep: 2.41
[2025-11-06 18:37:23,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.84 | bwd: 473.25 | bwd_inner: 1.71 | bwd_allreduce: 471.41 | step: 2.48
 61%|██████▏   | 2149/3507 [52:37<26:25,  1.17s/it]                                                   {'loss': 0.2215, 'learning_rate': 6.8887532886579896e-06, 'epoch': 0.61}
 61%|██████▏   | 2149/3507 [52:37<26:25,  1.17s/it]tensor([[-5.7500e+00, -2.7656e+00,  1.7969e+00, -1.4114e-03, -4.6250e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562,  0.8242,  3.2812, -2.4531, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031, -1.5938,  1.8203,  2.4688, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:25,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.03 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.1562, -3.6094,  0.1138,  3.1250, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -3.7031,  0.4531, -0.2129, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -2.2656,  3.0156,  0.3535, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -1.1172,  1.8438,  0.3145, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -1.7031,  2.3750,  0.3848, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:26,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:37:26,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.69 | bwd_microstep: 688.45 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 687.15 | step_microstep: 2.47
[2025-11-06 18:37:26,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 416.74 | bwd: 689.44 | bwd_inner: 2.05 | bwd_allreduce: 687.20 | step: 2.57
 61%|██████▏   | 2150/3507 [52:39<34:40,  1.53s/it]                                                   {'loss': 0.2054, 'learning_rate': 6.879975809264052e-06, 'epoch': 0.61}
 61%|██████▏   | 2150/3507 [52:39<34:40,  1.53s/it]tensor([[-5.2812, -3.9688,  0.3281,  2.2656, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -2.5938,  1.6094,  2.6250, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:26,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.55 | bwd_microstep: 1.45 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.5000, -3.4375, -1.9922,  2.6562, -0.1836]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.6250, -0.7617,  3.6719, -0.3418, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -1.8203,  1.7266,  0.5078, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719, -0.8945,  2.0312, -0.1680, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.4375, -5.8438, -0.8750,  0.5625, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125,  0.0688,  2.3125, -0.9492, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:37:27,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:37:27,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.59 | bwd_microstep: 740.53 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 739.60 | step_microstep: 3.10
[2025-11-06 18:37:27,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.17 | bwd: 741.99 | bwd_inner: 2.15 | bwd_allreduce: 739.66 | step: 3.22
 61%|██████▏   | 2151/3507 [52:41<32:42,  1.45s/it]                                                   {'loss': 1.0862, 'learning_rate': 6.871200992092999e-06, 'epoch': 0.61}
 61%|██████▏   | 2151/3507 [52:41<32:42,  1.45s/it]tensor([[-2.6406,  1.3594,  2.8750, -2.3750, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1562, -4.1562,  1.7031,  0.4531, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -3.5625, -0.7695,  3.5000, -0.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -1.2656,  3.2188, -0.8320, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:27,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.36 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7188, -4.0625, -0.9805,  1.5156, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -0.6797,  2.7188, -2.3438, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.4375, -4.6875,  0.8008,  0.1289, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -3.1250,  1.7578,  0.5156, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:29,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.22
[2025-11-06 18:37:29,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.64 | bwd_microstep: 874.43 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 873.29 | step_microstep: 2.13
[2025-11-06 18:37:29,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 541.02 | bwd: 875.32 | bwd_inner: 1.82 | bwd_allreduce: 873.35 | step: 2.22
 61%|██████▏   | 2152/3507 [52:42<33:51,  1.50s/it]                                                   {'loss': 0.5627, 'learning_rate': 6.862428844632114e-06, 'epoch': 0.61}
 61%|██████▏   | 2152/3507 [52:42<33:51,  1.50s/it]tensor([[-2.2812,  0.2148,  1.9766, -0.1348, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:29,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.26 | bwd_microstep: 1.43 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.3125, -4.1250,  0.5664,  2.8438, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -3.8594, -2.8750,  0.7734, -0.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000,  1.7188,  4.2188, -2.4219, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0625, -4.0625,  0.4844,  2.8906, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -4.6875, -0.7070,  3.0000, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -2.5312, -0.0120, -1.5156, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -0.3105,  3.3906, -2.1406, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:37:29,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:37:29,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.18 | bwd_microstep: 154.58 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 153.43 | step_microstep: 2.11
[2025-11-06 18:37:29,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 283.46 | bwd: 156.00 | bwd_inner: 2.40 | bwd_allreduce: 153.48 | step: 2.19
 61%|██████▏   | 2153/3507 [52:43<26:52,  1.19s/it]                                                   {'loss': 0.7513, 'learning_rate': 6.853659374366408e-06, 'epoch': 0.61}
 61%|██████▏   | 2153/3507 [52:43<26:52,  1.19s/it]tensor([[-3.3594, -2.2344,  0.8594,  2.2031, -1.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:29,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.85 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6094, -4.1875, -2.1719,  2.0156, -1.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.7344, -1.7266,  1.8828,  1.8516, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -1.5156,  2.1094,  0.8633, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0000, -2.2812, -0.8359,  2.2969, -0.1680]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -6.1875, -2.0625,  1.7578, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -5.1562, -1.2188,  1.8672, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -4.4062, -1.1875,  3.5781, -1.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:37:32,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:37:32,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.27 | bwd_microstep: 2108.04 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2106.96 | step_microstep: 2.09
[2025-11-06 18:37:32,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.14 | bwd: 2109.10 | bwd_inner: 1.96 | bwd_allreduce: 2107.01 | step: 2.18
 61%|██████▏   | 2154/3507 [52:45<35:56,  1.59s/it]                                                   {'loss': 0.5905, 'learning_rate': 6.8448925887786114e-06, 'epoch': 0.61}
 61%|██████▏   | 2154/3507 [52:45<35:56,  1.59s/it]tensor([[-3.7188, -3.8281, -0.2451,  4.0312, -1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -2.2500,  1.6484,  2.0625, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3438, -2.9844,  0.1924,  3.2031, -1.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -4.0938,  1.4141,  0.3926, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:32,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.63 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.4062, -2.2344,  0.7461,  0.0408, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344, -2.4531,  0.1084,  3.8281, -0.1982]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -4.5312,  0.8945,  3.1094, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -4.3125, -1.6875,  2.4375, -1.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:32,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:37:32,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.06 | bwd_microstep: 1.51 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.62 | step_microstep: 2.15
[2025-11-06 18:37:32,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 486.70 | bwd: 2.33 | bwd_inner: 1.56 | bwd_allreduce: 0.65 | step: 2.23
 61%|██████▏   | 2155/3507 [52:46<28:43,  1.28s/it]                                                   {'loss': 0.1944, 'learning_rate': 6.836128495349152e-06, 'epoch': 0.61}
 61%|██████▏   | 2155/3507 [52:46<28:43,  1.28s/it]tensor([[-4.2812, -3.1406,  0.6211,  2.4844, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:32,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.66 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -3.1875,  0.9922,  2.3906, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -1.1328,  3.0625, -0.9492, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2500,  1.2266,  4.1562, -1.7656, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9375, -6.0938, -2.3438,  2.0469, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.9062, -0.0830,  3.2031, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -3.3594,  1.3047,  1.3906, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -1.2969,  3.3750, -0.1084, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:35,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.34
[2025-11-06 18:37:35,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 109.81 | bwd_microstep: 2855.55 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 2854.34 | step_microstep: 2.60
[2025-11-06 18:37:35,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 262.49 | bwd: 2856.47 | bwd_inner: 1.93 | bwd_allreduce: 2854.39 | step: 2.68
 61%|██████▏   | 2156/3507 [52:49<41:23,  1.84s/it]                                                   {'loss': 0.3289, 'learning_rate': 6.827367101556168e-06, 'epoch': 0.61}
 61%|██████▏   | 2156/3507 [52:49<41:23,  1.84s/it]tensor([[-4.9062, -1.5469,  2.6406,  0.2852, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:35,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.21 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -2.3125,  1.5781,  0.7461, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -3.5312,  0.4180,  3.2188, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.3125,  1.4062,  0.9375, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -4.8125, -0.2412,  1.6172, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -2.5000,  1.6016,  1.4219, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -4.4688,  0.9727,  2.8281, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -2.2344,  2.7344, -0.2275, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:37:36,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:37:36,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.20 | bwd_microstep: 218.60 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 217.52 | step_microstep: 1.95
[2025-11-06 18:37:36,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.47 | bwd: 219.67 | bwd_inner: 1.96 | bwd_allreduce: 217.56 | step: 2.03
 62%|██████▏   | 2157/3507 [52:50<32:45,  1.46s/it]                                                   {'loss': 0.6763, 'learning_rate': 6.818608414875498e-06, 'epoch': 0.62}
 62%|██████▏   | 2157/3507 [52:50<32:45,  1.46s/it]tensor([[-3.5000, -3.9219, -2.2656,  1.5547, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -3.4375,  1.3828, -0.9023, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:36,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.99 | bwd_microstep: 6.07 | bwd_inner_microstep: 5.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-1.6797,  0.8867,  1.9531, -0.4277, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -2.5312,  2.5156, -1.1797, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -5.1562, -1.5312,  2.3906, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -2.5312,  2.7344, -0.5859, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0000, -6.0000, -1.8984,  0.5234, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -3.8125,  1.4375,  1.8984, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:39,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.23 | optimizer_step: 0.24
[2025-11-06 18:37:39,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.18 | bwd_microstep: 3053.06 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 3052.02 | step_microstep: 2.48
[2025-11-06 18:37:39,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.18 | bwd: 3059.12 | bwd_inner: 6.86 | bwd_allreduce: 3052.08 | step: 2.58
 62%|██████▏   | 2158/3507 [52:53<45:53,  2.04s/it]                                                   {'loss': 0.218, 'learning_rate': 6.809852442780664e-06, 'epoch': 0.62}
 62%|██████▏   | 2158/3507 [52:53<45:53,  2.04s/it]tensor([[ 0.4668,  3.9219,  4.6562, -0.3125, -0.7773]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:39,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.96 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4688, -3.6250, -1.4922,  1.9062, -1.3359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -4.5312, -2.1250,  2.8750, -0.9180]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -1.0781,  3.4531, -1.4062, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -1.6250,  1.5625,  1.0469, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1406,  1.2734,  3.7344, -1.8281, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7188, -4.9062,  0.7422,  2.1094, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -3.9688,  0.4473,  1.3594, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:37:40,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:37:40,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.82 | bwd_microstep: 95.10 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 93.88 | step_microstep: 1.88
[2025-11-06 18:37:40,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.79 | bwd: 96.03 | bwd_inner: 1.97 | bwd_allreduce: 93.92 | step: 1.97
 62%|██████▏   | 2159/3507 [52:53<34:53,  1.55s/it]                                                   {'loss': 0.3919, 'learning_rate': 6.80109919274287e-06, 'epoch': 0.62}
 62%|██████▏   | 2159/3507 [52:53<34:53,  1.55s/it]tensor([[-3.5781, -4.2500, -2.0000,  2.5625, -0.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -1.4062,  2.8281,  0.7422, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0312, -4.0625,  1.5312,  2.4688, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -5.4062, -1.1172,  1.3828, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:40,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7969, -2.1406,  0.5977,  0.3887, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -4.1250, -0.8477,  2.5312, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -2.9375,  1.8828,  1.1016, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2812, -3.4375,  2.0312,  0.9219, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:42,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:37:42,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.11 | bwd_microstep: 1517.38 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 1516.39 | step_microstep: 2.44
[2025-11-06 18:37:42,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.82 | bwd: 1518.15 | bwd_inner: 1.60 | bwd_allreduce: 1516.43 | step: 2.51
 62%|██████▏   | 2160/3507 [52:55<37:14,  1.66s/it]                                                   {'loss': 0.4869, 'learning_rate': 6.792348672231011e-06, 'epoch': 0.62}
 62%|██████▏   | 2160/3507 [52:55<37:14,  1.66s/it]tensor([[-5.4375, -3.6719,  0.4316,  1.0859, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:42,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 99.61 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4688,  0.1689,  3.6250, -0.2793, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -2.8594,  0.5234,  1.5703, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5156, -1.3125,  1.2578,  2.1406, -1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.9062, -4.8750,  0.6602,  1.5859, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -1.7734,  1.8516, -0.4492, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -1.9531,  2.7344, -0.0845, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.3125, -0.4688,  2.9531, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:37:42,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:37:42,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.66 | bwd_microstep: 193.79 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 192.80 | step_microstep: 2.36
[2025-11-06 18:37:42,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 223.28 | bwd: 194.76 | bwd_inner: 1.79 | bwd_allreduce: 192.84 | step: 2.44
 62%|██████▏   | 2161/3507 [52:56<29:02,  1.29s/it]                                                   {'loss': 0.7862, 'learning_rate': 6.783600888711633e-06, 'epoch': 0.62}
 62%|██████▏   | 2161/3507 [52:56<29:02,  1.29s/it]tensor([[-3.8750, -1.7578,  1.2109,  0.6289, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -4.2188, -1.9141,  1.2891, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:42,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.16 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3750, -4.3125,  0.9648,  1.6562, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.6250, -0.5078,  2.2031, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7344, -3.5156, -2.0625,  2.1562, -0.3926]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719,  0.3301,  3.2656, -1.4922, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.2188, -5.2812,  0.5547,  1.6797, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.7500,  1.4062,  0.9688, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:37:44,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.24 | optimizer_step: 0.36
[2025-11-06 18:37:44,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.89 | bwd_microstep: 1361.74 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1360.65 | step_microstep: 2.66
[2025-11-06 18:37:44,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 505.08 | bwd: 1362.72 | bwd_inner: 1.87 | bwd_allreduce: 1360.70 | step: 2.74
 62%|██████▏   | 2162/3507 [52:58<33:12,  1.48s/it]                                                   {'loss': 0.3251, 'learning_rate': 6.774855849648961e-06, 'epoch': 0.62}
 62%|██████▏   | 2162/3507 [52:58<33:12,  1.48s/it]tensor([[-4.8125, -2.0000,  2.0312,  0.5469, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6250, -3.0000, -0.8750,  3.0312, -0.4082]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -2.5000,  2.5469, -0.9570, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:37:44,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.27 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.1719, -2.8594, -1.6719,  2.0312, -0.1240]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -0.1426,  2.7188, -0.1699, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5000, -4.4375,  1.1719,  1.9453, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719,  0.0579,  2.0000,  0.6211, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0625, -6.8750, -3.5312,  0.1152, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:44,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:37:44,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.72 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.44
[2025-11-06 18:37:44,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.01 | bwd: 2.90 | bwd_inner: 1.86 | bwd_allreduce: 0.91 | step: 1.52
 62%|██████▏   | 2163/3507 [52:58<25:51,  1.15s/it]                                                   {'loss': 0.3553, 'learning_rate': 6.76611356250487e-06, 'epoch': 0.62}
 62%|██████▏   | 2163/3507 [52:58<25:51,  1.15s/it]tensor([[-4.5000, -1.1484,  2.1094, -0.7578, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500,  0.3809,  4.0312, -1.1484, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -5.4688, -0.2295,  0.9805, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:44,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.16 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5625, -3.0469,  0.5742,  1.5391, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9062, -4.9688, -0.5742,  2.2500, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8125, -5.0000, -0.9766,  1.7969, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -3.7656, -0.1436,  2.8281, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -0.0898,  2.9062, -0.8281, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:37:47,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.28 | optimizer_step: 0.28
[2025-11-06 18:37:47,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.68 | bwd_microstep: 2220.30 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2219.17 | step_microstep: 2.77
[2025-11-06 18:37:47,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.87 | bwd: 2221.11 | bwd_inner: 1.72 | bwd_allreduce: 2219.23 | step: 2.86
 62%|██████▏   | 2164/3507 [53:01<35:51,  1.60s/it]                                                   {'loss': 0.2447, 'learning_rate': 6.757374034738899e-06, 'epoch': 0.62}
 62%|██████▏   | 2164/3507 [53:01<35:51,  1.60s/it]tensor([[-5.1250, -1.3516,  3.6719,  0.3672, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:47,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.00 | bwd_microstep: 1.13 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.4062, -3.9375,  1.9219,  2.0469, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.6250,  0.2344,  3.3906, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.0781, -2.3281,  1.0000,  3.3906, -1.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([3], device='cuda:2')
tensor([[-3.2188,  0.0352,  2.9219,  0.0679, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.9219,  1.8984,  1.7734, -1.4453, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5312, -2.0938,  1.7656,  0.7422, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -0.5547,  3.7031, -1.9922, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:37:47,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:37:47,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.09 | bwd_microstep: 88.84 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 87.93 | step_microstep: 1.66
[2025-11-06 18:37:47,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.09 | bwd: 89.97 | bwd_inner: 1.82 | bwd_allreduce: 87.99 | step: 1.77
 62%|██████▏   | 2165/3507 [53:01<27:41,  1.24s/it]                                                   {'loss': 0.2476, 'learning_rate': 6.74863727380822e-06, 'epoch': 0.62}
 62%|██████▏   | 2165/3507 [53:01<27:41,  1.24s/it]tensor([[-3.7344, -0.0312,  2.4219, -1.5078, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8984, -2.7656, -1.6094,  2.4844,  0.1611]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -0.1128,  2.3750, -2.3125, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:48,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.56 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9375, -2.4688,  1.2031, -0.0554, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031,  1.4141,  3.6875, -2.6094, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -5.8125, -0.9297,  2.5156, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -1.5000,  2.6406, -0.8633, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7227,  2.2031,  2.0469, -1.7422, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:37:50,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:37:50,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.74 | bwd_microstep: 2638.23 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2637.07 | step_microstep: 2.57
[2025-11-06 18:37:50,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.34 | bwd: 2638.97 | bwd_inner: 1.71 | bwd_allreduce: 2637.11 | step: 2.65
 62%|██████▏   | 2166/3507 [53:04<40:01,  1.79s/it]                                                   {'loss': 0.2, 'learning_rate': 6.739903287167646e-06, 'epoch': 0.62}
 62%|██████▏   | 2166/3507 [53:04<40:01,  1.79s/it]tensor([[-4.5938, -0.8164,  3.7969,  0.0535, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -2.6875,  1.7969, -0.7383, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -4.1562,  0.6641,  1.8750, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:51,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.31 | bwd_microstep: 1.45 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.4062, -5.6875, -0.9648,  2.3438, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -1.8594,  2.0000,  0.1650, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -4.0938,  0.0127,  3.4844, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -2.2969,  2.7812, -0.3301, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4062, -0.7734,  2.0156,  0.3613, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:51,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:37:51,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.73 | bwd_microstep: 51.09 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 49.83 | step_microstep: 3.02
[2025-11-06 18:37:51,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.05 | bwd: 52.54 | bwd_inner: 2.50 | bwd_allreduce: 49.88 | step: 3.13
 62%|██████▏   | 2167/3507 [53:05<30:30,  1.37s/it]                                                   {'loss': 0.3214, 'learning_rate': 6.73117208226963e-06, 'epoch': 0.62}
 62%|██████▏   | 2167/3507 [53:05<30:30,  1.37s/it]tensor([[-5.0625, -4.0625,  0.0258,  2.4688, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.9688, -5.1562, -0.8516,  1.9609, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([3], device='cuda:2')
[2025-11-06 18:37:51,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.46 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1562, -3.3906,  1.3359,  2.2812, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -2.6094,  2.9219, -0.2402, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.9688, -5.1250,  1.3438,  0.9102, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.2344,  2.5312,  1.8750, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -2.7344,  1.3594,  1.9609, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -3.7969,  1.5312,  2.3125, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:53,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:37:53,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.70 | bwd_microstep: 1362.14 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1361.00 | step_microstep: 1.59
[2025-11-06 18:37:53,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.20 | bwd: 1363.03 | bwd_inner: 1.83 | bwd_allreduce: 1361.04 | step: 1.68
 62%|██████▏   | 2168/3507 [53:06<33:20,  1.49s/it]                                                   {'loss': 0.4587, 'learning_rate': 6.722443666564244e-06, 'epoch': 0.62}
 62%|██████▏   | 2168/3507 [53:06<33:20,  1.49s/it]tensor([[-4.4688, -4.1562, -0.2168,  3.2969, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:53,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.82 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.9688, -2.5469,  1.9062,  1.3438, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -2.7500,  1.3438,  1.4766, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8125, -4.2188,  1.3203,  0.9531, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -4.9062, -0.1475,  1.4922, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -4.5000, -1.4141,  3.0312, -1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1250, -4.1562,  1.7812,  0.5391, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7812, -1.2344,  1.9062,  0.0127, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:37:53,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.17 | optimizer_step: 0.22
[2025-11-06 18:37:53,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.44 | bwd_microstep: 194.14 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 193.02 | step_microstep: 2.30
[2025-11-06 18:37:53,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.30 | bwd: 195.30 | bwd_inner: 2.07 | bwd_allreduce: 193.07 | step: 2.41
 62%|██████▏   | 2169/3507 [53:07<27:29,  1.23s/it]                                                   {'loss': 0.3061, 'learning_rate': 6.7137180474991825e-06, 'epoch': 0.62}
 62%|██████▏   | 2169/3507 [53:07<27:29,  1.23s/it]tensor([[-5.4062, -4.6562, -1.2578,  1.2969, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0781, -3.8750, -2.3750,  1.9922, -0.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:53,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0938, -4.6562,  1.3750,  1.2969, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -3.7969, -0.1660,  3.7812, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-9.3750, -9.0000, -5.1562, -1.1797, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -1.8906,  2.5312, -1.2500, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -1.7500,  2.2812,  0.4453, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.7656,  1.2656,  2.0781, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:55,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.26 | optimizer_step: 0.23
[2025-11-06 18:37:55,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.28 | bwd_microstep: 1095.57 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1094.33 | step_microstep: 2.12
[2025-11-06 18:37:55,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.41 | bwd: 1096.45 | bwd_inner: 1.94 | bwd_allreduce: 1094.37 | step: 2.20
 62%|██████▏   | 2170/3507 [53:09<29:19,  1.32s/it]                                                   {'loss': 0.4128, 'learning_rate': 6.704995232519755e-06, 'epoch': 0.62}
 62%|██████▏   | 2170/3507 [53:09<29:19,  1.32s/it]tensor([[-6.3438, -5.8125, -1.3672,  1.9531, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:55,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.76 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0312, -2.8281,  2.7812,  1.1094, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.3613, 2.2344, 5.1562, 4.9062, 0.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -4.9688, -0.4004,  2.5625, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -0.0253,  2.1562, -0.7969, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -2.1094,  2.0469,  0.7969, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500,  0.1377,  3.9062, -1.5938, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -0.7188,  2.8906, -0.0728, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:37:56,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:37:56,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.59 | bwd_microstep: 1291.32 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1290.15 | step_microstep: 2.78
[2025-11-06 18:37:56,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.38 | bwd: 1292.33 | bwd_inner: 2.02 | bwd_allreduce: 1290.19 | step: 2.86
 62%|██████▏   | 2171/3507 [53:10<31:41,  1.42s/it]                                                   {'loss': 0.5744, 'learning_rate': 6.69627522906888e-06, 'epoch': 0.62}
 62%|██████▏   | 2171/3507 [53:10<31:41,  1.42s/it]tensor([[-4.2188, -1.2422,  2.3281,  0.1211, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8438, -3.6875, -2.1562,  2.3594, -0.4023]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281,  0.1050,  2.8594, -0.4766, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:37:57,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.54 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2500, -1.0156,  1.6875,  0.8203, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -0.0347,  2.9062, -1.5859, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3438, -4.0312,  1.5234,  1.8203, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -1.3984,  2.3125,  0.7539, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5312, -5.7812, -0.7500,  2.6875, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:37:57,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:37:57,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.17 | bwd_microstep: 113.17 | bwd_inner_microstep: 1.66 | bwd_allreduce_microstep: 111.44 | step_microstep: 1.82
[2025-11-06 18:37:57,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.74 | bwd: 113.95 | bwd_inner: 2.36 | bwd_allreduce: 111.47 | step: 1.89
 62%|██████▏   | 2172/3507 [53:11<25:12,  1.13s/it]                                                   {'loss': 0.5126, 'learning_rate': 6.687558044587072e-06, 'epoch': 0.62}
 62%|██████▏   | 2172/3507 [53:11<25:12,  1.13s/it]tensor([[-3.1406,  0.9180,  2.5156, -2.9062, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -2.7656,  1.7578,  1.7266, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:57,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 1.64 | bwd_inner_microstep: 1.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5312, -3.3906,  1.0391,  1.1016, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -4.1250,  1.6250,  2.1562, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7344, -0.2100,  2.4531, -1.2891, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -4.7188,  0.2402,  3.3438, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -1.9766,  2.1719,  1.2891, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -3.0469,  1.8906,  0.6094, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:37:59,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:37:59,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.79 | bwd_microstep: 701.08 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 699.95 | step_microstep: 2.08
[2025-11-06 18:37:59,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.49 | bwd: 702.72 | bwd_inner: 2.58 | bwd_allreduce: 699.98 | step: 2.17
 62%|██████▏   | 2173/3507 [53:13<30:52,  1.39s/it]                                                   {'loss': 0.35, 'learning_rate': 6.678843686512437e-06, 'epoch': 0.62}
 62%|██████▏   | 2173/3507 [53:13<30:52,  1.39s/it]tensor([[-5.7812, -4.2812,  0.8359,  2.6250, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:37:59,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.22 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.1562, -0.9805,  1.5000,  0.1426, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -1.8672,  3.0469,  0.2471, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -0.9844,  2.6094, -0.4551, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9375, -5.9062, -1.1250,  1.4609, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -5.2500, -1.7812,  2.3906, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -5.1562, -0.2412,  1.6797, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -4.3750, -1.5234,  2.8750, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:37:59,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:37:59,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.33 | bwd_microstep: 345.66 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 344.39 | step_microstep: 1.52
[2025-11-06 18:37:59,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.57 | bwd: 346.62 | bwd_inner: 2.07 | bwd_allreduce: 344.42 | step: 1.58
 62%|██████▏   | 2174/3507 [53:13<26:14,  1.18s/it]                                                   {'loss': 0.11, 'learning_rate': 6.670132162280685e-06, 'epoch': 0.62}
 62%|██████▏   | 2174/3507 [53:13<26:14,  1.18s/it]tensor([[-7.0312, -5.7500, -0.4961,  2.0312, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -4.5625, -0.6328,  1.4141, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2500, -2.0469,  3.2344, -0.7812, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:00,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.51 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.4375, -3.1094,  0.9609,  0.1387, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -4.9062, -0.0261,  3.5469, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594,  0.3262,  3.1250, -0.9180, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3594,  1.2578,  2.6562, -1.8672, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.5781,  0.4648,  2.2812, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:02,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:38:02,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.22 | bwd_microstep: 1249.32 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1248.20 | step_microstep: 2.19
[2025-11-06 18:38:02,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 510.76 | bwd: 1250.21 | bwd_inner: 1.83 | bwd_allreduce: 1248.24 | step: 2.27
 62%|██████▏   | 2175/3507 [53:16<36:55,  1.66s/it]                                                   {'loss': 0.1399, 'learning_rate': 6.66142347932509e-06, 'epoch': 0.62}
 62%|██████▏   | 2175/3507 [53:16<36:55,  1.66s/it]tensor([[-3.0625, -0.4336,  1.4062, -0.5664, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9062, -4.5000, -2.2188,  2.1094, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.4375, -2.3125,  1.7266,  1.5938, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:02,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.06 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2812, -0.7891,  3.9375,  1.0156, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.1562,  1.0391,  2.4688, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -1.5391,  2.4375,  0.0116, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1875, -3.9062,  1.4453, -0.4160, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -4.8125, -1.6953,  2.5000, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:38:03,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:38:03,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.95 | bwd_microstep: 19.74 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 18.75 | step_microstep: 1.85
[2025-11-06 18:38:03,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.03 | bwd: 20.64 | bwd_inner: 1.74 | bwd_allreduce: 18.79 | step: 1.92
 62%|██████▏   | 2176/3507 [53:16<28:23,  1.28s/it]                                                   {'loss': 1.1617, 'learning_rate': 6.652717645076516e-06, 'epoch': 0.62}
 62%|██████▏   | 2176/3507 [53:16<28:23,  1.28s/it]tensor([[-5.9375, -5.7188, -1.6953,  2.1250, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:03,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.83 | bwd_microstep: 1.19 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7500, -5.1562, -1.3984,  1.5781, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -1.1875,  3.9844, -0.3691, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2656,  1.9609,  3.0469, -1.2969, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -1.2031,  2.2031, -0.2217, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.9375, -4.8125,  0.4746,  0.9883, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.1094,  0.2354,  1.8438, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8672, -2.3750, -1.2109,  2.1094, -0.0178]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:05,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.28 | optimizer_step: 0.44
[2025-11-06 18:38:05,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.31 | bwd_microstep: 2323.08 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 2321.81 | step_microstep: 2.94
[2025-11-06 18:38:05,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.16 | bwd: 2324.28 | bwd_inner: 2.23 | bwd_allreduce: 2321.89 | step: 3.03
 62%|██████▏   | 2177/3507 [53:19<37:55,  1.71s/it]                                                   {'loss': 0.4767, 'learning_rate': 6.6440146669633855e-06, 'epoch': 0.62}
 62%|██████▏   | 2177/3507 [53:19<37:55,  1.71s/it][h264 @ 0xd0e7300] mmco: unref short failure
[h264 @ 0xd0e7300] mmco: unref short failure
tensor([[-5.6875, -1.9922,  2.6250, -0.4727, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:05,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 65.01 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -4.6875, -0.2441,  1.3750, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -4.2500, -0.2500,  2.5469, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5312,  0.0170,  1.6172, -0.3359, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7734, -1.1797,  0.6172,  4.5938,  1.0703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5938,  1.5859,  3.1719, -2.3438, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0938, -4.8750,  1.1328,  1.8047, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -3.4531,  0.8047,  1.9922, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:38:06,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 18:38:06,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.19 | bwd_microstep: 140.16 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 138.80 | step_microstep: 2.61
[2025-11-06 18:38:06,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 204.21 | bwd: 141.05 | bwd_inner: 2.05 | bwd_allreduce: 138.85 | step: 2.69
 62%|██████▏   | 2178/3507 [53:20<29:00,  1.31s/it]                                                   {'loss': 0.196, 'learning_rate': 6.635314552411687e-06, 'epoch': 0.62}
 62%|██████▏   | 2178/3507 [53:20<29:00,  1.31s/it]tensor([[-5.3750, -2.3594,  1.0781, -0.7812, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -0.7188,  3.1562,  0.6562, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7500, -2.7031,  0.9961,  1.0156, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:38:06,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.18 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.6250, -4.1562, -0.5195,  2.7500, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.2188, -5.1250,  0.1299,  0.6992, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -4.9375,  0.6484,  2.3906, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312,  0.3203,  2.3906, -2.0625, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5625, -6.1875, -1.9766,  2.0156, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:08,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.22 | optimizer_step: 0.23
[2025-11-06 18:38:08,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.96 | bwd_microstep: 1810.90 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1809.80 | step_microstep: 3.76
[2025-11-06 18:38:08,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.20 | bwd: 1811.79 | bwd_inner: 1.81 | bwd_allreduce: 1809.85 | step: 3.87
 62%|██████▏   | 2179/3507 [53:22<34:56,  1.58s/it]                                                   {'loss': 0.6354, 'learning_rate': 6.626617308844968e-06, 'epoch': 0.62}
 62%|██████▏   | 2179/3507 [53:22<34:56,  1.58s/it]tensor([[-5.4688, -1.3672,  2.7812, -1.5703, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0469, -0.4043,  0.4160, -1.7031, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.2188, -4.1875,  0.2393,  2.6875, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:08,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.81 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -1.6250,  2.0469, -0.9570, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -5.4062, -1.0078,  2.6094, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -2.8594,  0.8867,  2.2188, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9219, -3.5625, -1.4375,  2.9844, -0.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -0.6328,  3.9062, -1.3750, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:08,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.31 | optimizer_step: 0.27
[2025-11-06 18:38:08,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 254.39 | bwd_microstep: 6.76 | bwd_inner_microstep: 5.21 | bwd_allreduce_microstep: 1.39 | step_microstep: 7.79
[2025-11-06 18:38:08,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.25 | bwd: 7.44 | bwd_inner: 5.80 | bwd_allreduce: 1.43 | step: 7.90
 62%|██████▏   | 2180/3507 [53:22<27:48,  1.26s/it]                                                   {'loss': 0.2303, 'learning_rate': 6.617922943684327e-06, 'epoch': 0.62}
 62%|██████▏   | 2180/3507 [53:22<27:48,  1.26s/it]tensor([[-5.8750, -4.5625,  0.0693,  1.9531, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -0.6562,  2.9219,  0.8516, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -1.2500,  2.1719,  0.2051, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:09,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.13 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-8.0000, -5.2500,  0.9297,  0.5820, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.9688, -5.0312,  1.1016,  0.2070, -6.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.1250, -7.5312, -4.0312,  1.0078, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.9688, -4.5000,  0.3984,  2.2969, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -4.4062,  0.8906,  0.7969, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:38:10,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.14 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:38:10,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 302.73 | bwd_microstep: 742.10 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 741.27 | step_microstep: 3.07
[2025-11-06 18:38:10,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 496.89 | bwd: 742.83 | bwd_inner: 1.37 | bwd_allreduce: 741.31 | step: 3.17
 62%|██████▏   | 2181/3507 [53:24<28:04,  1.27s/it]                                                   {'loss': 0.8564, 'learning_rate': 6.609231464348402e-06, 'epoch': 0.62}
 62%|██████▏   | 2181/3507 [53:24<28:04,  1.27s/it]tensor([[-6.0625, -3.2969,  1.9375,  1.2422, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:10,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.55 | bwd_microstep: 1.68 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-3.7500, -4.1562, -1.0312,  3.6406, -1.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7188,  0.2754,  3.6562, -0.9062, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406,  0.4922,  2.0469, -0.7031, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -5.0625, -1.1328,  3.2656, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -2.9219,  1.7422,  0.5938, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6406,  1.2031,  2.7969, -2.1562, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5625, -2.6094, -0.3223,  2.6406, -0.7773]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:38:10,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:38:10,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.62 | bwd_microstep: 183.11 | bwd_inner_microstep: 1.83 | bwd_allreduce_microstep: 181.19 | step_microstep: 3.07
[2025-11-06 18:38:10,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.18 | bwd: 184.78 | bwd_inner: 3.39 | bwd_allreduce: 181.24 | step: 3.16
 62%|██████▏   | 2182/3507 [53:24<23:08,  1.05s/it]                                                   {'loss': 0.2515, 'learning_rate': 6.600542878253378e-06, 'epoch': 0.62}
 62%|██████▏   | 2182/3507 [53:24<23:08,  1.05s/it]tensor([[-4.9062, -2.5469,  1.4375,  0.7070, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -3.9062, -0.0273,  1.8984, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:11,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.71 | bwd_microstep: 5.43 | bwd_inner_microstep: 5.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-5.2500, -4.2188,  0.2139,  2.8125, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3125, -1.5625,  1.2500, -0.7891, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812, -4.1875,  2.1562,  2.1406, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -3.9844, -0.2969,  3.0625, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2812, -3.4375, -2.7188,  1.8438,  0.0062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -0.5547,  2.8750,  0.1719, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:38:13,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.24 | optimizer_step: 0.25
[2025-11-06 18:38:13,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.41 | bwd_microstep: 1614.21 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 1613.26 | step_microstep: 300.19
[2025-11-06 18:38:13,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.13 | bwd: 1619.62 | bwd_inner: 6.07 | bwd_allreduce: 1613.33 | step: 300.31
 62%|██████▏   | 2183/3507 [53:26<31:51,  1.44s/it]                                                   {'loss': 0.5646, 'learning_rate': 6.591857192812955e-06, 'epoch': 0.62}
 62%|██████▏   | 2183/3507 [53:26<31:51,  1.44s/it]tensor([[-3.6250, -3.9219, -0.4082,  4.3125, -0.9648]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3125, -4.2812,  1.3828, -0.2295, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.2188,  1.3906,  0.7305, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:13,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.52 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.1875, -4.5938, -1.4766,  3.0469, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9688, -6.6562, -1.4922,  3.0469, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -1.2344,  2.2344, -0.6289, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -3.6094,  0.9219,  1.5781, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -2.0781,  2.1719, -0.6484, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:14,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:38:14,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.16 | bwd_microstep: 1.77 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.70 | step_microstep: 7.32
[2025-11-06 18:38:14,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.69 | bwd: 2.78 | bwd_inner: 1.86 | bwd_allreduce: 0.76 | step: 7.42
 62%|██████▏   | 2184/3507 [53:28<32:50,  1.49s/it]                                                   {'loss': 0.4407, 'learning_rate': 6.583174415438372e-06, 'epoch': 0.62}
 62%|██████▏   | 2184/3507 [53:28<32:50,  1.49s/it]tensor([[-5.5312, -3.3750,  0.5859,  0.4707, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -3.8750, -0.3652,  2.4375, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8906,  0.0942,  2.8906, -0.1318, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:15,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.02 | bwd_microstep: 1.51 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13
tensor([[-4.2500, -3.7812, -0.8125,  1.8984, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[ 0.2344,  3.4062,  5.8750,  2.1250, -0.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -1.5156,  3.3594, -0.2695, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -2.8750,  2.3594,  0.4746, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1250, -3.2812,  2.1250,  1.0938, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:38:16,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:38:16,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 257.21 | bwd_microstep: 1446.29 | bwd_inner_microstep: 4.96 | bwd_allreduce_microstep: 1441.20 | step_microstep: 2.38
[2025-11-06 18:38:16,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 464.27 | bwd: 1447.77 | bwd_inner: 6.27 | bwd_allreduce: 1441.27 | step: 2.53
 62%|██████▏   | 2185/3507 [53:30<36:00,  1.63s/it]                                                   {'loss': 0.3187, 'learning_rate': 6.574494553538379e-06, 'epoch': 0.62}
 62%|██████▏   | 2185/3507 [53:30<36:00,  1.63s/it]tensor([[-6.7500, -2.5469,  2.5000, -1.5547, -6.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1953, -1.6484, -0.3359,  3.0938,  0.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.9062, -0.5078,  3.3594,  0.2334, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:16,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.58 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -3.5156,  1.6797,  1.0781, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5938, -1.9766,  3.1250, -2.1094, -6.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.9062e+00, -6.7500e+00, -2.5469e+00, -3.2654e-03, -5.0938e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -2.7344,  2.3594,  1.5078, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.6719,  1.0781, -1.2109, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:18,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:38:18,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.40 | bwd_microstep: 2.48 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.59
[2025-11-06 18:38:18,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.99 | bwd: 3.11 | bwd_inner: 2.08 | bwd_allreduce: 0.88 | step: 2.68
 62%|██████▏   | 2186/3507 [53:32<37:53,  1.72s/it]                                                   {'loss': 0.9719, 'learning_rate': 6.565817614519245e-06, 'epoch': 0.62}
 62%|██████▏   | 2186/3507 [53:32<37:53,  1.72s/it]tensor([[-3.9688, -1.1719,  2.7656,  1.1797, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3125, -5.0625,  0.0957,  0.1768, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.9688,  0.1016,  3.2031, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -3.3750,  0.7930,  3.7812, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:18,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.86 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1562, -2.0781,  1.5938, -0.9258, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2188, -1.6172,  2.8281, -2.6562, -6.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0781, -2.0469,  0.8633,  4.1875, -0.2139]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -0.6328,  2.9219, -0.2617, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:38:19,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:38:19,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.89 | bwd_microstep: 501.20 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 499.81 | step_microstep: 1.83
[2025-11-06 18:38:19,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.78 | bwd: 502.12 | bwd_inner: 2.11 | bwd_allreduce: 499.86 | step: 1.91
 62%|██████▏   | 2187/3507 [53:33<32:26,  1.47s/it]                                                   {'loss': 0.3585, 'learning_rate': 6.557143605784743e-06, 'epoch': 0.62}
 62%|██████▏   | 2187/3507 [53:33<32:26,  1.47s/it]tensor([[-2.3281,  1.9141,  3.0000, -2.7969, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:19,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.56 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7344, -2.4531,  0.6641,  1.4219, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1094, -2.1719,  0.5508,  1.8984, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2500, -4.6562,  0.6797,  2.4375, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -1.3594,  2.7812, -0.4238, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4688,  1.6016,  3.2500, -1.9219, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -2.9062,  1.2891,  2.5938, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250,  0.2080,  2.6094, -1.8516, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:22,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:38:22,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.47 | bwd_microstep: 1.77 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.07
[2025-11-06 18:38:22,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.03 | bwd: 2.63 | bwd_inner: 1.74 | bwd_allreduce: 0.77 | step: 2.15
 62%|██████▏   | 2188/3507 [53:35<38:50,  1.77s/it]                                                   {'loss': 0.3806, 'learning_rate': 6.5484725347361374e-06, 'epoch': 0.62}
 62%|██████▏   | 2188/3507 [53:35<38:50,  1.77s/it]tensor([[-2.0938,  0.3398,  1.8359,  0.0052, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000,  1.0312,  4.0312, -1.8359, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.8750, -5.0938,  1.3750, -0.9414, -7.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -3.3281,  1.8906,  0.5859, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:22,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.15 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.2500, -2.1094,  1.5391, -1.1875, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.3125,  1.2344,  0.6211, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -2.7344,  2.3906,  1.1875, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719,  0.8984,  1.7188, -1.5703, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:22,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:38:22,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.87 | bwd_microstep: 1.42 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.64 | step_microstep: 4.05
[2025-11-06 18:38:22,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 479.05 | bwd: 2.31 | bwd_inner: 1.50 | bwd_allreduce: 0.67 | step: 4.14
 62%|██████▏   | 2189/3507 [53:36<30:40,  1.40s/it]                                                   {'loss': 0.3976, 'learning_rate': 6.5398044087721946e-06, 'epoch': 0.62}
 62%|██████▏   | 2189/3507 [53:36<30:40,  1.40s/it]tensor([[-5.2188, -2.2656,  1.8984, -0.0334, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:22,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.93 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -4.2188, -0.8008,  2.9688, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0938, -3.9844,  2.2500,  1.0625, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -3.7969,  0.8164,  0.2715, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4062,  1.0625,  2.1875, -2.0312, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9219, -2.9375,  0.5352,  2.4062, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -4.5938, -0.4355,  2.8281, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -3.7500,  0.5234,  2.3750, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:23,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:38:23,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.36 | bwd_microstep: 1051.88 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1050.70 | step_microstep: 1.75
[2025-11-06 18:38:23,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.31 | bwd: 1052.73 | bwd_inner: 1.83 | bwd_allreduce: 1050.75 | step: 1.85
 62%|██████▏   | 2190/3507 [53:37<30:49,  1.40s/it]                                                   {'loss': 0.3451, 'learning_rate': 6.5311392352891704e-06, 'epoch': 0.62}
 62%|██████▏   | 2190/3507 [53:37<30:49,  1.40s/it]tensor([[-6.4688, -4.4688,  0.6641,  1.6172, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:24,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 64.55 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5312, -0.6211,  3.3594, -0.8828, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -0.6406,  2.9062, -1.4141, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -3.3906,  0.1553,  4.0000, -1.1641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -3.2031,  0.9102,  0.9688, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4844,  0.4629,  3.2031, -1.4609, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.7969,  1.3828,  2.7344, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -2.7656,  2.5625,  0.7812, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:24,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:38:24,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.21 | bwd_microstep: 109.94 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 108.89 | step_microstep: 1.86
[2025-11-06 18:38:24,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 210.78 | bwd: 110.98 | bwd_inner: 1.92 | bwd_allreduce: 108.93 | step: 1.94
 62%|██████▏   | 2191/3507 [53:38<23:50,  1.09s/it]                                                   {'loss': 0.5907, 'learning_rate': 6.522477021680791e-06, 'epoch': 0.62}
 62%|██████▏   | 2191/3507 [53:38<23:50,  1.09s/it]tensor([[-5.0938, -4.8750, -0.5547,  3.7188, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -1.9766,  1.6875,  1.7188, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:24,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.97 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4375, -4.7812, -1.5312,  3.2344, -1.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -5.5938, -1.5625,  2.2969, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6484,  1.8203,  2.1250, -2.2188, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.1250, -5.2500, -0.4824,  0.4121, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -3.0938,  1.3359,  1.2422, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375,  0.5898,  3.3438, -2.5469, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:38:26,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:38:26,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.32 | bwd_microstep: 2290.59 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 2289.59 | step_microstep: 2.58
[2025-11-06 18:38:26,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.31 | bwd: 2291.39 | bwd_inner: 1.61 | bwd_allreduce: 2289.63 | step: 2.65
 63%|██████▎   | 2192/3507 [53:40<34:05,  1.56s/it]                                                   {'loss': 0.8097, 'learning_rate': 6.513817775338268e-06, 'epoch': 0.63}
 63%|██████▎   | 2192/3507 [53:40<34:05,  1.56s/it]tensor([[-3.9688,  0.1934,  2.8750, -2.1406, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-9.6250, -5.9688, -0.3535, -2.7188, -7.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:27,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.02 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0000, -2.6875,  2.5000,  0.4023, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[2.7812, 4.6250, 6.5938, 5.2500, 2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -3.2656,  0.7617,  0.8555, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -2.1719,  2.6250,  0.7812, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -4.9688, -0.6328,  0.9375, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2344,  1.7578,  3.0312, -2.2344, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:38:27,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:38:27,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.34 | bwd_microstep: 450.67 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 449.64 | step_microstep: 1.63
[2025-11-06 18:38:27,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.38 | bwd: 451.60 | bwd_inner: 1.80 | bwd_allreduce: 449.68 | step: 1.71
 63%|██████▎   | 2193/3507 [53:41<29:09,  1.33s/it]                                                   {'loss': 0.9499, 'learning_rate': 6.505161503650277e-06, 'epoch': 0.63}
 63%|██████▎   | 2193/3507 [53:41<29:09,  1.33s/it]tensor([[-2.5781,  0.8633,  2.3750, -1.6875, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5000, -3.0625, -0.3652,  2.3438, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:27,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.29 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.9688, -1.6719,  2.7031,  4.2500, -1.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -4.1562,  0.8281,  1.7109, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -4.8125, -0.3750,  2.9688, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938,  0.1260,  2.9531, -2.6406, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.2500, -3.5000,  0.3262,  2.8906, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -3.0469,  2.2812,  2.2969, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:38:30,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.25 | optimizer_step: 0.29
[2025-11-06 18:38:30,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.65 | bwd_microstep: 2412.91 | bwd_inner_microstep: 2.05 | bwd_allreduce_microstep: 2410.59 | step_microstep: 2.66
[2025-11-06 18:38:30,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.97 | bwd: 2414.03 | bwd_inner: 3.17 | bwd_allreduce: 2410.65 | step: 2.76
 63%|██████▎   | 2194/3507 [53:44<38:38,  1.77s/it]                                                   {'loss': 0.6826, 'learning_rate': 6.496508214002948e-06, 'epoch': 0.63}
 63%|██████▎   | 2194/3507 [53:44<38:38,  1.77s/it]tensor([[-3.5625, -0.9336,  2.6562,  1.1328, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -4.5625,  0.3281,  2.5781, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -2.3750,  0.8359,  1.6953, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -2.2344,  1.5000,  1.0625, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3125, -4.8125,  1.0703,  3.3438, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:30,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.22 | bwd_microstep: 5.42 | bwd_inner_microstep: 5.31 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3438, -3.0312,  2.6562,  0.5273, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -3.9062, -0.4824,  2.5781, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594, -0.4551,  2.9531,  0.9453, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:31,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:38:31,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.83 | bwd_microstep: 2.14 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.07
[2025-11-06 18:38:31,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 471.07 | bwd: 7.56 | bwd_inner: 6.53 | bwd_allreduce: 0.88 | step: 2.15
 63%|██████▎   | 2195/3507 [53:44<30:35,  1.40s/it]                                                   {'loss': 0.7373, 'learning_rate': 6.487857913779876e-06, 'epoch': 0.63}
 63%|██████▎   | 2195/3507 [53:44<30:35,  1.40s/it]tensor([[-3.1406, -4.0625, -2.6250,  1.9219, -0.6836]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9531,  1.8125,  3.4062, -1.5703, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4062, -1.2891,  2.0781, -0.9258, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:31,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.71 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.4062, -3.2969,  0.8320,  2.8281, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5391, -2.2344, -0.4102,  3.8750,  0.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.3750,  0.5117,  1.4453, -3.5781, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -2.8750,  2.5469,  1.0938, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906,  0.2090,  2.6875, -1.4453, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:38:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:38:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.68 | bwd_microstep: 2674.45 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 2673.67 | step_microstep: 2.15
[2025-11-06 18:38:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.42 | bwd: 2675.30 | bwd_inner: 1.43 | bwd_allreduce: 2673.72 | step: 2.24
 63%|██████▎   | 2196/3507 [53:47<41:31,  1.90s/it]                                                   {'loss': 0.7724, 'learning_rate': 6.479210610362103e-06, 'epoch': 0.63}
 63%|██████▎   | 2196/3507 [53:47<41:31,  1.90s/it]tensor([[-5.0000, -2.8906,  0.3887, -0.3066, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -1.7656,  2.8750, -0.2539, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -2.6094,  1.7109,  1.2734, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -0.8125,  3.7344, -0.7852, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:34,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.89 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1562, -2.8906, -0.3984,  2.6250, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.2344,  1.2422,  3.0938, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -1.5000,  2.1719, -0.2158, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -1.6797,  3.2344, -0.2334, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:34,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:38:34,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.89 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.16
[2025-11-06 18:38:34,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 459.81 | bwd: 2.48 | bwd_inner: 1.49 | bwd_allreduce: 0.84 | step: 2.24
 63%|██████▎   | 2197/3507 [53:48<32:20,  1.48s/it]                                                   {'loss': 0.2303, 'learning_rate': 6.470566311128113e-06, 'epoch': 0.63}
 63%|██████▎   | 2197/3507 [53:48<32:20,  1.48s/it]tensor([[-0.7148, -1.3906, -0.2227,  3.3906,  1.0391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -3.4531,  0.3477, -0.3945, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:34,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.78 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.4062, -4.3750, -0.6523,  3.5000, -1.7578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2578, -1.3516,  0.9961,  4.4062,  0.4609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -4.1562,  0.2559,  2.8438, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -4.0000,  1.1094,  1.0938, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -0.4824,  1.6562, -2.4688, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -3.1562,  0.8516, -0.4922, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:38:36,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:38:36,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.36 | bwd_microstep: 1340.23 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1339.17 | step_microstep: 1.77
[2025-11-06 18:38:36,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.17 | bwd: 1341.30 | bwd_inner: 1.90 | bwd_allreduce: 1339.24 | step: 1.88
 63%|██████▎   | 2198/3507 [53:50<34:02,  1.56s/it]                                                   {'loss': 0.3151, 'learning_rate': 6.46192502345383e-06, 'epoch': 0.63}
 63%|██████▎   | 2198/3507 [53:50<34:02,  1.56s/it]tensor([[-3.5938,  0.7695,  2.4688, -3.1875, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.3750, -5.7500, -1.3672,  0.0623, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:36,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.12 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4062, -4.5938, -0.0151,  3.0312, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -2.8594,  1.1641,  1.7656, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5625, -6.3438, -1.0781,  1.7109, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -2.2812,  1.3672, -0.4570, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -2.4688,  1.2578,  0.8164, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.8750, -6.4688, -0.2773,  0.2695, -6.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:38:37,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.47 | optimizer_step: 0.51
[2025-11-06 18:38:37,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.47 | bwd_microstep: 713.47 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 712.49 | step_microstep: 4.13
[2025-11-06 18:38:37,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.62 | bwd: 714.41 | bwd_inner: 1.67 | bwd_allreduce: 712.57 | step: 4.21
 63%|██████▎   | 2199/3507 [53:51<31:00,  1.42s/it]                                                   {'loss': 0.3191, 'learning_rate': 6.4532867547126e-06, 'epoch': 0.63}
 63%|██████▎   | 2199/3507 [53:51<31:00,  1.42s/it]tensor([[-5.2812, -2.0000,  2.8906,  0.5391, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -4.5312,  0.5898,  1.4609, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:37,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.94 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5312, -1.6172,  1.8516, -0.4102, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -1.5547,  2.1094, -0.2441, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.7500, -4.4688,  1.6719,  0.1152, -6.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8750, -1.6094,  3.4375, -1.3281, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -1.2734,  2.7656,  1.5547, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375,  0.8477,  1.9531, -2.9219, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:38:40,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.22 | optimizer_step: 0.25
[2025-11-06 18:38:40,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.31 | bwd_microstep: 2286.77 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2285.68 | step_microstep: 2.87
[2025-11-06 18:38:40,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.26 | bwd: 2287.51 | bwd_inner: 1.63 | bwd_allreduce: 2285.73 | step: 2.96
 63%|██████▎   | 2200/3507 [53:54<39:35,  1.82s/it]                                                   {'loss': 0.7276, 'learning_rate': 6.444651512275198e-06, 'epoch': 0.63}
 63%|██████▎   | 2200/3507 [53:54<39:35,  1.82s/it]tensor([[-3.9375, -1.7578,  1.5000,  1.0391, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7188, -3.4531,  2.2969,  0.3887, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:40,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.31 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.15
tensor([[3.6562, 3.7344, 5.5000, 7.9375, 4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -0.7656,  1.6094, -1.5547, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.6875, -3.3125, -0.7695,  3.8125, -0.2393]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188,  0.3574,  4.0000, -1.4297, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -3.6875, -0.2910,  2.1250, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.2188, -5.5000,  0.4727,  2.5000, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:38:41,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:38:41,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.72 | bwd_microstep: 375.43 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 374.28 | step_microstep: 1.72
[2025-11-06 18:38:41,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.06 | bwd: 376.56 | bwd_inner: 1.98 | bwd_allreduce: 374.38 | step: 1.86
 63%|██████▎   | 2201/3507 [53:54<32:39,  1.50s/it]                                                   {'loss': 0.4496, 'learning_rate': 6.43601930350982e-06, 'epoch': 0.63}
 63%|██████▎   | 2201/3507 [53:54<32:39,  1.50s/it]tensor([[-7.3125, -4.0938, -0.5469, -2.9219, -6.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6250, -4.0000,  0.9648,  0.3340, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:41,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.80 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -3.2812, -0.3555,  1.5625, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -4.4688, -0.3574,  3.7031, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -4.3750,  0.3770,  0.8164, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.1875, -0.0903,  3.0938, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.7188,  0.2227,  3.3594, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -2.1250,  1.5781,  1.5625, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:42,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:38:42,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.92 | bwd_microstep: 1331.94 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1330.89 | step_microstep: 1.86
[2025-11-06 18:38:42,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.75 | bwd: 1332.93 | bwd_inner: 1.87 | bwd_allreduce: 1330.93 | step: 1.95
 63%|██████▎   | 2202/3507 [53:56<34:30,  1.59s/it]                                                   {'loss': 0.4966, 'learning_rate': 6.427390135782068e-06, 'epoch': 0.63}
 63%|██████▎   | 2202/3507 [53:56<34:30,  1.59s/it]tensor([[-8.3750, -5.6250, -0.3652, -0.9375, -6.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0000, -4.7812,  1.2109,  1.9766, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:42,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.27 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7188,  1.5781,  3.3125, -2.4375, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -3.1250,  1.3281,  1.1094, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2354,  2.8594,  3.1719, -0.4043, -0.9883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5312, -3.1562,  0.8867,  2.7969, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -5.1250, -0.7656,  1.4297, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1562,  1.6797,  2.9688, -2.2188, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:38:43,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:38:43,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.15 | bwd_microstep: 791.61 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 790.67 | step_microstep: 1.88
[2025-11-06 18:38:43,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.44 | bwd: 792.49 | bwd_inner: 1.63 | bwd_allreduce: 790.71 | step: 1.97
 63%|██████▎   | 2203/3507 [53:57<31:39,  1.46s/it]                                                   {'loss': 0.7333, 'learning_rate': 6.418764016454953e-06, 'epoch': 0.63}
 63%|██████▎   | 2203/3507 [53:57<31:39,  1.46s/it]tensor([[-5.0000, -3.7500,  0.6484,  2.4219, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.9375,  0.4434,  2.2500, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3438, -1.4219,  1.3594,  1.2656, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:44,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.73 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4688, -3.2812, -0.1660,  3.1406, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -1.7422,  0.8828,  0.7852, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -1.1875,  1.8828, -0.3555, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219, -3.5156, -0.5977,  2.3438, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -6.0625, -2.7656,  1.8906, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:46,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:38:46,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.71 | bwd_microstep: 1442.75 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1441.41 | step_microstep: 2.12
[2025-11-06 18:38:46,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 570.46 | bwd: 1443.73 | bwd_inner: 2.11 | bwd_allreduce: 1441.46 | step: 2.20
 63%|██████▎   | 2204/3507 [54:00<38:37,  1.78s/it]                                                   {'loss': 0.7823, 'learning_rate': 6.410140952888887e-06, 'epoch': 0.63}
 63%|██████▎   | 2204/3507 [54:00<38:37,  1.78s/it]tensor([[-4.1562, -3.5469,  0.2480,  3.4375, -1.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6953, -0.6641,  0.9922,  1.5469, -0.8086]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -3.0938,  0.6758,  1.7109, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:46,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.41 | bwd_microstep: 1.66 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.20
tensor([[-4.0312, -0.3301,  3.0781, -0.8203, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -4.3750, -0.1367,  1.1016, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -3.2188,  1.3516, -0.8984, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -0.9141,  3.1406, -1.6094, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -2.8281,  1.9922,  0.5508, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:47,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.58 | optimizer_step: 0.48
[2025-11-06 18:38:47,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 267.66 | bwd_microstep: 2.84 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1.72 | step_microstep: 5.86
[2025-11-06 18:38:47,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 474.14 | bwd: 4.51 | bwd_inner: 2.32 | bwd_allreduce: 1.86 | step: 6.06
 63%|██████▎   | 2205/3507 [54:00<30:34,  1.41s/it]                                                   {'loss': 0.5573, 'learning_rate': 6.401520952441662e-06, 'epoch': 0.63}
 63%|██████▎   | 2205/3507 [54:00<30:34,  1.41s/it]tensor([[-4.7500, -4.4688, -1.1953,  2.4375, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -4.1562, -0.0747,  2.4688, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:47,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.75 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-6.9688, -3.5469,  2.1875, -0.1377, -5.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -3.3906,  0.9180, -0.8711, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6055,  3.1250,  3.7812, -1.2734, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8750,  0.1338,  3.6406, -1.1797, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -4.9375, -0.8789,  2.9531, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3906,  2.5938,  3.5938, -2.1719, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:38:49,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.86 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:38:49,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.61 | bwd_microstep: 1529.10 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1527.87 | step_microstep: 3.23
[2025-11-06 18:38:49,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.38 | bwd: 1530.01 | bwd_inner: 1.91 | bwd_allreduce: 1527.92 | step: 3.34
 63%|██████▎   | 2206/3507 [54:02<34:19,  1.58s/it]                                                   {'loss': 0.2321, 'learning_rate': 6.3929040224684725e-06, 'epoch': 0.63}
 63%|██████▎   | 2206/3507 [54:02<34:19,  1.58s/it]tensor([[-6.4062, -5.8750, -1.8125,  1.7266, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -2.7969,  1.2422,  0.1270, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:49,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.16 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.7812, -4.6562,  1.1641,  2.0156, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -3.7500,  1.0781,  2.1406, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.8594,  0.2334,  2.2500, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7500, -4.5938,  1.3906,  2.1875, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.4688,  2.0000,  2.0156, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -5.0625, -0.9844,  2.1719, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:38:50,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:38:50,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.92 | bwd_microstep: 1047.29 | bwd_inner_microstep: 1.73 | bwd_allreduce_microstep: 1045.45 | step_microstep: 2.95
[2025-11-06 18:38:50,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.15 | bwd: 1048.23 | bwd_inner: 2.59 | bwd_allreduce: 1045.49 | step: 3.04
 63%|██████▎   | 2207/3507 [54:04<33:28,  1.55s/it]                                                   {'loss': 0.7833, 'learning_rate': 6.384290170321881e-06, 'epoch': 0.63}
 63%|██████▎   | 2207/3507 [54:04<33:28,  1.55s/it]tensor([[-5.5938, -3.5625,  0.4238,  0.1748, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6562,  1.9844,  2.4531, -2.4219, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:38:50,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.30 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-7.7188, -5.1250,  1.1562,  1.1250, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -0.4141,  2.2188, -2.4219, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031, -3.3750, -0.3750,  3.7500, -0.8633]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -3.2656,  1.0000,  1.0859, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -4.4062, -0.5781,  2.3125, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3438,  0.5195,  1.7734, -1.2031, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:38:51,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:38:51,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.06 | bwd_microstep: 712.25 | bwd_inner_microstep: 4.97 | bwd_allreduce_microstep: 707.17 | step_microstep: 1.96
[2025-11-06 18:38:51,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.39 | bwd: 713.14 | bwd_inner: 5.80 | bwd_allreduce: 707.20 | step: 2.03
 63%|██████▎   | 2208/3507 [54:05<30:41,  1.42s/it]                                                   {'loss': 0.3763, 'learning_rate': 6.375679403351834e-06, 'epoch': 0.63}
 63%|██████▎   | 2208/3507 [54:05<30:41,  1.42s/it]tensor([[-3.0781, -0.0801,  2.3594, -0.3711, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -2.3750,  1.9688,  0.4590, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:51,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.94 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.4062, -4.0000,  2.2031,  0.1279, -5.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -2.3594,  1.7734,  0.2500, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -3.3906,  0.1885,  2.9531, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.3594,  1.2891,  2.6094, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -4.4375, -1.5547,  1.6797, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-6.9375, -4.1562,  0.6445, -0.3672, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:38:52,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:38:52,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.90 | bwd_microstep: 870.38 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 869.34 | step_microstep: 1.58
[2025-11-06 18:38:52,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 446.87 | bwd: 871.20 | bwd_inner: 1.69 | bwd_allreduce: 869.38 | step: 1.67
 63%|██████▎   | 2209/3507 [54:06<30:21,  1.40s/it]                                                   {'loss': 0.8738, 'learning_rate': 6.36707172890564e-06, 'epoch': 0.63}
 63%|██████▎   | 2209/3507 [54:06<30:21,  1.40s/it]tensor([[-7.0625, -5.5312,  0.3730,  2.5469, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:53,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.66 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.2188, -5.1875,  0.9570,  2.1406, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1562, -1.8828,  0.3223,  1.3281, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0000, -1.5859,  2.5312, -0.1621, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1719,  2.1094,  2.5938, -1.2891, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.4180, -0.5664,  2.5312,  6.4375,  1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.9688, -1.3438,  1.8828, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2500, -3.3438, -2.0625,  2.7031,  0.1396]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:54,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:38:54,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.06 | bwd_microstep: 1180.96 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1180.08 | step_microstep: 1.93
[2025-11-06 18:38:54,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.73 | bwd: 1181.93 | bwd_inner: 1.66 | bwd_allreduce: 1180.13 | step: 2.02
 63%|██████▎   | 2210/3507 [54:08<30:50,  1.43s/it]                                                   {'loss': 0.1922, 'learning_rate': 6.3584671543279655e-06, 'epoch': 0.63}
 63%|██████▎   | 2210/3507 [54:08<30:50,  1.43s/it]tensor([[-2.4375, -1.4375,  0.5312,  1.4141, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:54,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.19 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3750, -4.1562,  0.8438,  3.2969, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.5000, -0.2637,  2.2812, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.9688, -0.5234,  2.0625, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9141,  0.7578,  1.3359, -1.1719, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.3750, -1.3203,  3.1875, -0.9531, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -3.4688,  1.1641,  0.4668, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -3.1562,  0.8086, -0.5742, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:38:55,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:38:55,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.93 | bwd_microstep: 467.33 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 466.29 | step_microstep: 2.25
[2025-11-06 18:38:55,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.15 | bwd: 468.31 | bwd_inner: 1.83 | bwd_allreduce: 466.33 | step: 2.33
 63%|██████▎   | 2211/3507 [54:09<27:15,  1.26s/it]                                                   {'loss': 0.4001, 'learning_rate': 6.349865686960832e-06, 'epoch': 0.63}
 63%|██████▎   | 2211/3507 [54:09<27:15,  1.26s/it]tensor([[-2.6875,  0.7148,  2.4688, -1.0078, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:38:55,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.06 | bwd_microstep: 1.36 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12
tensor([[-4.5312, -4.5000, -0.7109,  3.3594, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -3.5000,  0.7031,  0.3926, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -4.8438, -1.3594,  2.7656, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -2.1875,  3.0781,  0.0532, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5312, -4.1875, -1.8984,  2.6719, -0.9102]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2812, -3.5625,  1.5312,  0.7227, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -4.2500,  1.4844,  2.3281, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:58,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.32
[2025-11-06 18:38:58,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.47 | bwd_microstep: 2361.03 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 2359.48 | step_microstep: 2.22
[2025-11-06 18:38:58,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.54 | bwd: 2362.40 | bwd_inner: 2.55 | bwd_allreduce: 2359.59 | step: 2.35
 63%|██████▎   | 2212/3507 [54:11<36:58,  1.71s/it]                                                   {'loss': 0.4229, 'learning_rate': 6.341267334143621e-06, 'epoch': 0.63}
 63%|██████▎   | 2212/3507 [54:11<36:58,  1.71s/it]tensor([[-7.6250, -4.5625,  1.1641,  0.3203, -5.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -5.0000, -2.3906,  1.7812, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:58,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.10 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5000, -2.3750,  1.5078,  1.3906, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.5625, -7.6250, -2.3906,  0.7695, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3906, -2.7969, -0.6992,  3.1250, -0.2520]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-1.2031,  2.4531,  2.5938, -2.0000, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.1562, -1.2891,  2.4062,  0.8633, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3438, -6.8438, -4.0938,  0.6992, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:38:58,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:38:58,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 207.70 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 206.66 | step_microstep: 1.59
[2025-11-06 18:38:58,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.55 | bwd: 208.74 | bwd_inner: 1.89 | bwd_allreduce: 206.71 | step: 1.68
 63%|██████▎   | 2213/3507 [54:12<29:54,  1.39s/it]                                                   {'loss': 0.7112, 'learning_rate': 6.332672103213042e-06, 'epoch': 0.63}
 63%|██████▎   | 2213/3507 [54:12<29:54,  1.39s/it]tensor([[-5.8125, -2.7344,  1.6719, -0.0208, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -4.3750,  0.4141,  2.6562, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:38:58,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.76 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6875, -2.1719,  2.9688, -1.7812, -6.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-9.9375, -8.8125, -2.4688,  1.1875, -6.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -0.4531,  2.8125, -1.9844, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9062, -4.0312,  1.6875,  0.7031, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7969,  1.2188,  3.4375, -1.5156, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594,  0.4414,  3.7812, -1.6016, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:00,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.22 | optimizer_step: 0.25
[2025-11-06 18:39:00,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.39 | bwd_microstep: 1577.10 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1576.24 | step_microstep: 2.88
[2025-11-06 18:39:00,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 494.17 | bwd: 1577.98 | bwd_inner: 1.54 | bwd_allreduce: 1576.29 | step: 2.97
 63%|██████▎   | 2214/3507 [54:14<34:39,  1.61s/it]                                                   {'loss': 0.2288, 'learning_rate': 6.3240800015031544e-06, 'epoch': 0.63}
 63%|██████▎   | 2214/3507 [54:14<34:39,  1.61s/it]tensor([[-5.5625, -5.1250, -1.1719,  2.2500, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.6406,  1.4141,  1.6016, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[h264 @ 0xdbd4100] mmco: unref short failure
tensor([[-4.1250, -2.6094,  1.1094,  2.1875, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.9375, -5.4062, -0.3652, -0.6133, -5.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[h264 @ 0x980dd40] mmco: unref short failure
tensor([[-3.4375, -2.7812,  1.1641,  3.9531, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -2.8750,  0.7383,  2.5469, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -5.1562, -0.4805,  2.3125, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:02,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.96 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11
tensor([[-4.9688, -4.5625, -0.3691,  3.1719, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:02,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:39:02,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.08 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.13
[2025-11-06 18:39:02,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.06 | bwd: 2.88 | bwd_inner: 1.77 | bwd_allreduce: 0.91 | step: 2.24
 63%|██████▎   | 2215/3507 [54:16<34:41,  1.61s/it]                                                   {'loss': 0.2572, 'learning_rate': 6.315491036345338e-06, 'epoch': 0.63}
 63%|██████▎   | 2215/3507 [54:16<34:41,  1.61s/it]tensor([[-5.6250, -4.2500, -0.3535,  0.9258, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500, -2.5938,  0.4570,  3.7031, -0.7852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -1.2656,  2.5938,  2.5000, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2031,  0.1533,  3.6094,  0.8047, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4844, -4.1250, -1.7969,  2.5469, -1.0391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -4.4688, -1.3438, -0.0149, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:02,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 475.66 | bwd_microstep: 6.35 | bwd_inner_microstep: 6.21 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.9375, -5.6875,  0.4473,  1.2422, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4375, -4.5938,  0.7539,  1.9922, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:05,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:39:05,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.19 | bwd_microstep: 2095.50 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 2094.48 | step_microstep: 2.25
[2025-11-06 18:39:05,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 687.89 | bwd: 2101.84 | bwd_inner: 7.13 | bwd_allreduce: 2094.54 | step: 2.35
 63%|██████▎   | 2216/3507 [54:19<42:38,  1.98s/it]                                                   {'loss': 0.522, 'learning_rate': 6.306905215068294e-06, 'epoch': 0.63}
 63%|██████▎   | 2216/3507 [54:19<42:38,  1.98s/it]tensor([[-8.6250, -5.6875,  0.9883,  0.3652, -6.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:05,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.07 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1562, -3.1406,  2.5469,  1.0547, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.0000e+00, -8.5000e+00, -5.1875e+00,  6.2256e-03, -4.2812e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -0.6523,  3.5625, -0.1592, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -3.2344,  0.6680,  0.8945, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1406,  1.4375,  2.7812, -1.7109, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -6.6250, -3.3594,  0.7188, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -0.3906,  2.6875, -0.5664, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:06,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:39:06,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.94 | bwd_microstep: 503.11 | bwd_inner_microstep: 4.74 | bwd_allreduce_microstep: 498.27 | step_microstep: 1.90
[2025-11-06 18:39:06,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.04 | bwd: 503.89 | bwd_inner: 5.42 | bwd_allreduce: 498.32 | step: 1.98
 63%|██████▎   | 2217/3507 [54:19<35:17,  1.64s/it]                                                   {'loss': 0.2359, 'learning_rate': 6.298322544998048e-06, 'epoch': 0.63}
 63%|██████▎   | 2217/3507 [54:19<35:17,  1.64s/it]tensor([[-3.4688, -1.8281,  2.2656,  3.1562, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7500, -7.0312, -3.2500,  0.1416, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:06,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.84 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6094, -2.1250, -0.0527,  2.0938, -0.9805]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -3.6406,  0.4766,  2.3750, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -2.1094,  1.9844,  0.3203, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8281, -0.2393,  1.1016, -0.9922, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-0.8555,  1.2969,  1.3359, -0.1187, -0.8633]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125, -2.5156, -0.1021,  2.5312, -0.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:09,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.32
[2025-11-06 18:39:09,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.64 | bwd_microstep: 3117.15 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 3116.32 | step_microstep: 2.47
[2025-11-06 18:39:09,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.51 | bwd: 3117.86 | bwd_inner: 1.34 | bwd_allreduce: 3116.37 | step: 2.56
 63%|██████▎   | 2218/3507 [54:23<46:51,  2.18s/it]                                                   {'loss': 0.8938, 'learning_rate': 6.2897430334579355e-06, 'epoch': 0.63}
 63%|██████▎   | 2218/3507 [54:23<46:51,  2.18s/it]tensor([[-3.1094, -4.0938, -2.9375,  1.3750, -0.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:09,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.73 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.8438, -5.4375, -0.4473,  3.7031, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -5.7812,  0.1455,  2.3281, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -3.3594,  1.7109, -0.9883, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -1.0000,  3.4688, -1.1562, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4688,  0.1396,  3.0469, -0.8438, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -2.3906,  2.3750, -1.1797, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -1.2656,  2.4219, -0.6094, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:39:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:39:10,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.65 | bwd_microstep: 106.58 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 105.51 | step_microstep: 1.54
[2025-11-06 18:39:10,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 464.40 | bwd: 107.51 | bwd_inner: 1.83 | bwd_allreduce: 105.56 | step: 1.62
 63%|██████▎   | 2219/3507 [54:24<36:43,  1.71s/it]                                                   {'loss': 0.4045, 'learning_rate': 6.281166687768596e-06, 'epoch': 0.63}
 63%|██████▎   | 2219/3507 [54:24<36:43,  1.71s/it]tensor([[-4.9062, -4.2812, -0.4902,  2.3906, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5938,  1.6875,  3.9844, -1.8594, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:39:10,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.03 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -2.8125,  1.1250,  1.1016, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -5.4375, -1.2344,  3.1719, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -4.4688,  0.3613,  1.3828, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8906, -1.4766,  2.0938,  1.4062, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -4.8438, -0.2617,  3.0156, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.8750,  0.5234,  2.3594, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:12,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:39:12,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.57 | bwd_microstep: 1528.07 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1527.09 | step_microstep: 1.66
[2025-11-06 18:39:12,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.63 | bwd: 1528.95 | bwd_inner: 1.68 | bwd_allreduce: 1527.14 | step: 1.75
 63%|██████▎   | 2220/3507 [54:25<37:54,  1.77s/it]                                                   {'loss': 0.6413, 'learning_rate': 6.272593515247971e-06, 'epoch': 0.63}
 63%|██████▎   | 2220/3507 [54:25<37:54,  1.77s/it]tensor([[-6.0625, -3.2812,  2.0781,  1.2578, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594, -3.8906, -0.2461,  4.0938, -1.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -1.2812,  2.3281, -2.4062, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -2.9844,  3.1719,  0.8477, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:12,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.47 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.0938, -0.9727,  2.7031,  0.2754, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2812,  1.2344,  3.5781, -2.0312, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.7812, -5.1875, -0.7812,  2.5781, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.8438, -1.5469,  0.6289, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:12,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:39:12,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.65 | bwd_microstep: 2.77 | bwd_inner_microstep: 1.52 | bwd_allreduce_microstep: 1.14 | step_microstep: 1.85
[2025-11-06 18:39:12,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 489.15 | bwd: 3.65 | bwd_inner: 2.31 | bwd_allreduce: 1.17 | step: 1.94
 63%|██████▎   | 2221/3507 [54:26<29:58,  1.40s/it]                                                   {'loss': 0.404, 'learning_rate': 6.264023523211283e-06, 'epoch': 0.63}
 63%|██████▎   | 2221/3507 [54:26<29:58,  1.40s/it]tensor([[-6.5625, -4.3438,  0.9922,  1.4062, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -3.6250, -2.2969,  1.4531, -0.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.4375, -3.3125,  0.8789,  3.1562, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:12,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.83 | bwd_microstep: 1.40 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.1250, -2.8906,  1.9531,  4.3125, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -2.9219,  1.0078,  0.4375, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -6.0938, -3.1875,  1.2656, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8750,  0.8320,  3.3750, -2.8125, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7188, -5.1250, -0.5859,  1.0156, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:15,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:39:15,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.70 | bwd_microstep: 2711.31 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 2710.08 | step_microstep: 1.99
[2025-11-06 18:39:15,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.55 | bwd: 2712.71 | bwd_inner: 2.39 | bwd_allreduce: 2710.14 | step: 2.10
 63%|██████▎   | 2222/3507 [54:29<41:00,  1.91s/it]                                                   {'loss': 0.477, 'learning_rate': 6.255456718971053e-06, 'epoch': 0.63}
 63%|██████▎   | 2222/3507 [54:29<41:00,  1.91s/it]tensor([[-7.7500, -5.4062,  0.2676,  0.7852, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:15,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.17 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.6250, -6.0625, -0.3848,  1.6875, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.5000,  0.4277,  2.0938, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094, -0.1060,  2.2031, -0.7656, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.2188, -3.7344,  1.5234,  1.3359, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0000, -3.3594,  2.2812,  1.4688, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -2.1562,  1.9609,  1.7500, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -4.7188, -1.1250,  2.6250, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:16,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:39:16,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.62 | bwd_microstep: 84.86 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 83.57 | step_microstep: 1.96
[2025-11-06 18:39:16,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 294.82 | bwd: 86.09 | bwd_inner: 2.35 | bwd_allreduce: 83.62 | step: 2.05
 63%|██████▎   | 2223/3507 [54:29<31:19,  1.46s/it]                                                   {'loss': 0.7949, 'learning_rate': 6.246893109837076e-06, 'epoch': 0.63}
 63%|██████▎   | 2223/3507 [54:29<31:19,  1.46s/it]tensor([[-5.5312, -2.7344,  1.8906,  0.2969, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -4.7188,  0.1084,  1.2109, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -3.7969,  1.6484,  1.8281, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -4.0625,  1.5000,  1.7266, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:16,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.21 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.4375, -4.4375, -1.6250,  2.0469, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -4.8125, -1.5781,  1.4062, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -4.7500, -0.6328,  3.4062, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -5.0938, -1.0781,  1.8359, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:17,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:39:17,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.03 | bwd_microstep: 754.60 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 753.44 | step_microstep: 1.81
[2025-11-06 18:39:17,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.27 | bwd: 755.65 | bwd_inner: 2.00 | bwd_allreduce: 753.49 | step: 1.91
 63%|██████▎   | 2224/3507 [54:31<29:35,  1.38s/it]                                                   {'loss': 0.3958, 'learning_rate': 6.238332703116425e-06, 'epoch': 0.63}
 63%|██████▎   | 2224/3507 [54:31<29:35,  1.38s/it]tensor([[-6.2812, -3.5000,  1.8516,  1.1094, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -4.4062,  0.1436,  1.7734, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:17,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.43 | bwd_microstep: 5.72 | bwd_inner_microstep: 5.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -1.6484,  3.4375, -0.6016, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -4.7500,  0.1484,  2.7500, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -5.7812, -1.1172,  3.0156, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -0.5312,  3.6562, -1.8594, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -5.4062, -2.1562,  2.7656, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -0.3672,  3.2812, -2.7031, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:17,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:39:17,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.44 | bwd_microstep: 58.67 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 57.73 | step_microstep: 2.48
[2025-11-06 18:39:17,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.89 | bwd: 64.39 | bwd_inner: 6.46 | bwd_allreduce: 57.77 | step: 2.57
 63%|██████▎   | 2225/3507 [54:31<23:41,  1.11s/it]                                                   {'loss': 0.6234, 'learning_rate': 6.2297755061134354e-06, 'epoch': 0.63}
 63%|██████▎   | 2225/3507 [54:31<23:41,  1.11s/it]tensor([[-3.4219,  0.5391,  1.9609, -3.0156, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.4688, -3.9062, -0.0295,  2.9844, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:18,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.89 | bwd_microstep: 2.36 | bwd_inner_microstep: 2.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.22
tensor([[-4.0000, -2.6406,  0.9375,  2.2812, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7344,  0.7891,  3.6719, -0.1562, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9727,  2.3750,  2.1875, -1.7344, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.9375, -3.6719,  1.1328,  1.1172, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -4.6250, -1.2266,  2.8125, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -2.2031,  1.6406,  0.6602, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:20,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:39:20,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.88 | bwd_microstep: 1787.85 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1786.64 | step_microstep: 2.17
[2025-11-06 18:39:20,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 504.74 | bwd: 1790.21 | bwd_inner: 3.37 | bwd_allreduce: 1786.69 | step: 2.39
 63%|██████▎   | 2226/3507 [54:34<31:49,  1.49s/it]                                                   {'loss': 0.4659, 'learning_rate': 6.221221526129715e-06, 'epoch': 0.63}
 63%|██████▎   | 2226/3507 [54:34<31:49,  1.49s/it]tensor([[-4.0938, -4.1562, -1.3594,  2.2188, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594, -0.2520,  2.3125,  0.4219, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -2.4844,  2.7500, -0.9023, -5.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -3.7812,  0.7109,  1.8984, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:20,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.55 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.9375, -5.3438,  0.5781,  2.8281, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -4.0312, -1.7266,  1.9531, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -5.1250, -1.8125,  2.2656, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[3.9688, 5.3750, 7.0000, 6.8750, 3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:39:20,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:39:20,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.10 | bwd_microstep: 29.90 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 29.05 | step_microstep: 1.56
[2025-11-06 18:39:20,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.67 | bwd: 30.74 | bwd_inner: 1.51 | bwd_allreduce: 29.09 | step: 1.66
 64%|██████▎   | 2227/3507 [54:34<24:46,  1.16s/it]                                                   {'loss': 0.4282, 'learning_rate': 6.212670770464102e-06, 'epoch': 0.64}
 64%|██████▎   | 2227/3507 [54:34<24:46,  1.16s/it]tensor([[-3.3906, -3.8125, -1.4453,  2.6406, -0.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -4.9375, -0.1172,  2.2031, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:20,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -4.7188, -1.4922,  2.2656, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -1.6484,  1.8125, -0.8672, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.4375,  2.0781,  2.8750, -1.6953, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -0.1904,  3.7344, -1.2422, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -3.4531,  0.7500,  2.6562, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -3.6875,  0.6719,  2.0312, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:23,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:39:23,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.49 | bwd_microstep: 2684.33 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2683.07 | step_microstep: 2.58
[2025-11-06 18:39:23,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.19 | bwd: 2685.01 | bwd_inner: 1.74 | bwd_allreduce: 2683.12 | step: 2.66
 64%|██████▎   | 2228/3507 [54:37<36:51,  1.73s/it]                                                   {'loss': 0.3047, 'learning_rate': 6.20412324641271e-06, 'epoch': 0.64}
 64%|██████▎   | 2228/3507 [54:37<36:51,  1.73s/it]tensor([[-6.0625, -3.8906,  1.3438,  1.7188, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -4.5625, -1.9219,  2.3125, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:23,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.39 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.6562, -6.2812, -1.8594,  1.6953, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -3.4062,  2.0312,  0.8281, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.4688, -0.6875,  2.1562, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3125, -6.1250, -1.5625,  2.7812, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5000, -5.4062, -1.2031, -0.8516, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -1.1719,  2.1250, -0.9453, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:39:24,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:39:24,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.67 | bwd_microstep: 71.15 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 70.05 | step_microstep: 2.34
[2025-11-06 18:39:24,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.10 | bwd: 72.04 | bwd_inner: 1.77 | bwd_allreduce: 70.10 | step: 2.44
 64%|██████▎   | 2229/3507 [54:37<28:53,  1.36s/it]                                                   {'loss': 0.7585, 'learning_rate': 6.195578961268881e-06, 'epoch': 0.64}
 64%|██████▎   | 2229/3507 [54:37<28:53,  1.36s/it]tensor([[-6.8750, -4.7188,  0.9531,  1.8047, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.7344,  0.4473,  1.6797, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -3.9531, -0.6406, -2.3594, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:24,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.90 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2500, -5.6875, -1.5547,  1.7188, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -0.9805,  2.8281, -1.2344, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2031, -2.9375, -1.8203,  1.9844, -0.1030]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.6562, -3.7500,  0.7852,  1.5859, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0000,  2.2500,  3.4531, -2.2812, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:27,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.21 | optimizer_step: 0.34
[2025-11-06 18:39:27,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.95 | bwd_microstep: 3047.97 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 3046.94 | step_microstep: 3.47
[2025-11-06 18:39:27,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.88 | bwd: 3048.69 | bwd_inner: 1.54 | bwd_allreduce: 3046.98 | step: 3.55
 64%|██████▎   | 2230/3507 [54:41<43:52,  2.06s/it]                                                   {'loss': 0.4759, 'learning_rate': 6.187037922323198e-06, 'epoch': 0.64}
 64%|██████▎   | 2230/3507 [54:41<43:52,  2.06s/it]tensor([[-2.2188, -0.6172,  1.5859,  1.5156, -1.3984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -1.2422,  1.8672, -0.5391, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:28,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.42 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.3438, -6.6562, -1.7109,  1.8281, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -5.4062, -2.7656,  1.8984, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -0.4980,  2.7344, -1.6250, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -3.0781,  2.0781, -0.1270, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1172,  1.0156,  3.7344,  2.6562, -0.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6875, -3.7031,  2.4219,  1.6328, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:39:28,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.32 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:39:28,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 199.33 | bwd_microstep: 66.51 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 65.07 | step_microstep: 3.54
[2025-11-06 18:39:28,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.78 | bwd: 67.44 | bwd_inner: 2.19 | bwd_allreduce: 65.11 | step: 3.63
 64%|██████▎   | 2231/3507 [54:42<33:53,  1.59s/it]                                                   {'loss': 0.3728, 'learning_rate': 6.178500136863477e-06, 'epoch': 0.64}
 64%|██████▎   | 2231/3507 [54:42<33:53,  1.59s/it]tensor([[-4.5625, -1.6094,  2.0000,  0.0591, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -4.4688, -0.3477,  1.6094, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.8125, -1.5312,  2.4844, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0312, -0.9453,  2.5938,  0.0708, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:39:28,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.46 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4219, -4.0000, -1.2891,  3.5156, -0.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -4.5938, -2.3906,  2.0000, -1.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -4.1250, -0.7031,  1.5547, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -5.3750, -0.6016,  2.4062, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:31,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:39:31,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.64 | bwd_microstep: 2935.12 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2933.92 | step_microstep: 2.21
[2025-11-06 18:39:31,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.12 | bwd: 2935.96 | bwd_inner: 1.87 | bwd_allreduce: 2933.97 | step: 2.30
 64%|██████▎   | 2232/3507 [54:45<44:55,  2.11s/it]                                                   {'loss': 0.0843, 'learning_rate': 6.169965612174744e-06, 'epoch': 0.64}
 64%|██████▎   | 2232/3507 [54:45<44:55,  2.11s/it]tensor([[-3.5156e+00, -3.2344e+00,  1.2741e-03,  3.2812e+00, -1.3203e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -4.7188, -2.6406,  1.4844, -1.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.9219,  1.2578,  3.3750, -1.8047, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:31,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.08 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3125, -1.5625,  1.5703,  3.8906, -0.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.8750, -0.7422,  1.7578, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7656, -3.9688, -1.8281,  1.6953, -1.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -3.7188, -0.0508,  2.3906, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -3.4844,  0.3320,  1.6797, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:32,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 18:39:32,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.10 | bwd_microstep: 50.68 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 49.63 | step_microstep: 1.94
[2025-11-06 18:39:32,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.20 | bwd: 51.65 | bwd_inner: 1.83 | bwd_allreduce: 49.67 | step: 2.03
 64%|██████▎   | 2233/3507 [54:45<34:22,  1.62s/it]                                                   {'loss': 1.2831, 'learning_rate': 6.161434355539258e-06, 'epoch': 0.64}
 64%|██████▎   | 2233/3507 [54:45<34:22,  1.62s/it]tensor([[-4.6562, -4.7812, -1.0312,  3.6094, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:32,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.66 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.4062, -4.3750,  0.0747,  0.4473, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -0.7500,  1.8984,  0.0339, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -4.7500, -0.9922,  3.1875, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -2.2344,  1.5391,  1.5156, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -1.3984,  3.1094, -0.4629, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -3.5469,  0.0820,  1.8750, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.4062,  0.0200,  1.9766, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:34,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.36 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:39:34,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.01 | bwd_microstep: 2343.45 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 2342.03 | step_microstep: 3.78
[2025-11-06 18:39:34,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.70 | bwd: 2344.36 | bwd_inner: 2.11 | bwd_allreduce: 2342.09 | step: 3.87
 64%|██████▎   | 2234/3507 [54:48<41:35,  1.96s/it]                                                   {'loss': 0.8554, 'learning_rate': 6.1529063742364844e-06, 'epoch': 0.64}
 64%|██████▎   | 2234/3507 [54:48<41:35,  1.96s/it]tensor([[-2.1875,  1.6953,  3.1875, -1.8594, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -4.5000, -1.6016,  1.6953, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -4.5312, -0.5156,  1.9141, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:35,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.40 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2031,  2.1094,  4.2500, -1.3047, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6875, -1.9297,  1.0000,  1.2422, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -4.4062, -0.3301,  1.3906, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5938, -1.5078,  1.3281,  0.3984, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -3.4844,  0.6758,  1.1094, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:39:35,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:39:35,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 115.44 | bwd_microstep: 140.69 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 139.80 | step_microstep: 1.95
[2025-11-06 18:39:35,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.86 | bwd: 141.45 | bwd_inner: 1.45 | bwd_allreduce: 139.85 | step: 2.03
 64%|██████▎   | 2235/3507 [54:49<32:13,  1.52s/it]                                                   {'loss': 0.5415, 'learning_rate': 6.144381675543092e-06, 'epoch': 0.64}
 64%|██████▎   | 2235/3507 [54:49<32:13,  1.52s/it]tensor([[-5.5312, -4.0625,  0.1094,  1.5234, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -5.6250, -1.0859,  2.9688, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.3438, -1.2734,  1.9688, -0.0469, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([2], device='cuda:0')
[2025-11-06 18:39:35,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.96 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.1641,  2.1875,  3.7344, -0.5195, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.8906, -3.1094, -0.1445,  3.8281, -0.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -4.6562, -0.8477,  3.2500, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.5625,  0.2715,  2.6562, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -5.2500, -1.8438,  2.9219, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:39:37,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 5.61 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:39:37,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.86 | bwd_microstep: 1824.52 | bwd_inner_microstep: 6.25 | bwd_allreduce_microstep: 1818.17 | step_microstep: 7.77
[2025-11-06 18:39:37,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 411.85 | bwd: 1825.31 | bwd_inner: 6.93 | bwd_allreduce: 1818.23 | step: 7.87
 64%|██████▍   | 2236/3507 [54:51<37:02,  1.75s/it]                                                   {'loss': 0.2921, 'learning_rate': 6.135860266732952e-06, 'epoch': 0.64}
 64%|██████▍   | 2236/3507 [54:51<37:02,  1.75s/it]tensor([[-2.7500,  0.8477,  3.5469, -0.6875, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5938, -2.1719,  1.5547,  0.6445, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:37,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.49 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0000, -4.1562,  0.5156,  3.3906, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -0.5312,  2.2344, -0.5625, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -4.6562, -1.6250,  2.5938, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1875, -6.0312, -1.2656,  1.2891, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9531, -3.6250,  0.2773,  3.7500, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -4.2812,  0.3672,  2.9531, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:38,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:39:38,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.67 | bwd_microstep: 70.85 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 69.93 | step_microstep: 1.93
[2025-11-06 18:39:38,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.19 | bwd: 71.82 | bwd_inner: 1.69 | bwd_allreduce: 69.97 | step: 2.02
 64%|██████▍   | 2237/3507 [54:51<28:35,  1.35s/it]                                                   {'loss': 0.443, 'learning_rate': 6.127342155077127e-06, 'epoch': 0.64}
 64%|██████▍   | 2237/3507 [54:51<28:35,  1.35s/it]tensor([[-5.7500, -4.6250, -0.5430,  1.3594, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -0.2188,  2.5312, -0.5352, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:39:38,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.93 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3125,  1.6172,  3.1875, -1.7734, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -4.0938, -1.8047,  2.1094, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -3.8281, -0.7109,  2.5312, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6094,  1.7812,  3.4219, -0.1104, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875,  0.1514,  2.3281, -1.0781, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.6250, -1.1797,  1.7734, -0.9688, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:39:41,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.15
[2025-11-06 18:39:41,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.87 | bwd_microstep: 3213.20 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 3211.84 | step_microstep: 2.09
[2025-11-06 18:39:41,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.85 | bwd: 3213.93 | bwd_inner: 1.86 | bwd_allreduce: 3211.89 | step: 2.18
 64%|██████▍   | 2238/3507 [54:55<42:41,  2.02s/it]                                                   {'loss': 0.9841, 'learning_rate': 6.118827347843862e-06, 'epoch': 0.64}
 64%|██████▍   | 2238/3507 [54:55<42:41,  2.02s/it]tensor([[-4.7188, -0.6055,  3.7031, -0.9062, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -4.0312, -0.0056,  2.2812, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:41,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.99 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0938, -7.6562, -4.4375,  0.5859, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3750, -6.5625, -0.5391,  1.1250, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -4.5938, -1.8750,  2.1719, -1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.6094,  0.2363,  3.1094, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -0.9102,  3.1719, -0.0282, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6250, -5.2188, -0.1836,  1.7344, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:42,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:39:42,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.07 | bwd_microstep: 117.82 | bwd_inner_microstep: 1.79 | bwd_allreduce_microstep: 115.95 | step_microstep: 1.76
[2025-11-06 18:39:42,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.10 | bwd: 118.65 | bwd_inner: 2.52 | bwd_allreduce: 115.98 | step: 1.84
 64%|██████▍   | 2239/3507 [54:56<33:05,  1.57s/it]                                                   {'loss': 0.645, 'learning_rate': 6.110315852298586e-06, 'epoch': 0.64}
 64%|██████▍   | 2239/3507 [54:56<33:05,  1.57s/it]tensor([[-3.5625, -0.5273,  0.8242, -1.9688, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2500, -3.9844, -0.5742,  2.8906, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:39:42,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.40 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4062, -5.0938, -0.9141,  3.0000, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -1.6875,  1.4453, -1.2891, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -3.6875,  1.2031,  2.9219, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -4.0938,  0.2832,  0.6914, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -4.1562,  0.7031,  3.3281, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -0.5430,  3.0156,  0.8594, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:39:44,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:39:44,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 110.25 | bwd_microstep: 2054.28 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2053.24 | step_microstep: 1.76
[2025-11-06 18:39:44,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.66 | bwd: 2055.10 | bwd_inner: 1.67 | bwd_allreduce: 2053.28 | step: 1.84
 64%|██████▍   | 2240/3507 [54:58<38:19,  1.82s/it]                                                   {'loss': 1.1439, 'learning_rate': 6.101807675703906e-06, 'epoch': 0.64}
 64%|██████▍   | 2240/3507 [54:58<38:19,  1.82s/it]tensor([[-4.0000,  0.0251,  3.2812, -1.3125, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -4.5938, -1.0078,  2.9688, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -1.9141,  1.1016, -1.4297, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:44,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.94 | bwd_microstep: 1.30 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-2.5000,  2.0938,  4.3750, -1.9844, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -1.6641,  2.1562, -0.4980, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -5.2188, -0.9531,  2.9219, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -3.8906,  0.6016,  2.8750, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3438, -4.1875,  0.1318,  2.4531, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:39:45,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.27
[2025-11-06 18:39:45,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.89 | bwd_microstep: 49.78 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 48.58 | step_microstep: 2.05
[2025-11-06 18:39:45,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.86 | bwd: 51.08 | bwd_inner: 2.34 | bwd_allreduce: 48.62 | step: 2.12
 64%|██████▍   | 2241/3507 [54:58<29:39,  1.41s/it]                                                   {'loss': 0.0771, 'learning_rate': 6.093302825319589e-06, 'epoch': 0.64}
 64%|██████▍   | 2241/3507 [54:58<29:39,  1.41s/it]tensor([[-3.2344,  0.7656,  3.0938, -1.6016, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7188, -2.8125,  3.3125,  0.3594, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:45,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.91 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.8125, -4.7500,  0.7891,  1.5859, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -1.1406,  3.3125, -1.7734, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9375, -4.3750,  0.6328,  2.2969, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -1.3125,  3.3750, -1.2188, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -1.2344,  2.8594, -0.9766, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7031,  0.4395,  2.6250, -0.1279, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:39:46,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.32 | optimizer_step: 0.27
[2025-11-06 18:39:46,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.88 | bwd_microstep: 872.49 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 871.45 | step_microstep: 2.63
[2025-11-06 18:39:46,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.81 | bwd: 873.43 | bwd_inner: 1.77 | bwd_allreduce: 871.51 | step: 2.72
 64%|██████▍   | 2242/3507 [55:00<28:40,  1.36s/it]                                                   {'loss': 0.1194, 'learning_rate': 6.084801308402579e-06, 'epoch': 0.64}
 64%|██████▍   | 2242/3507 [55:00<28:40,  1.36s/it]tensor([[-5.0312, -1.7188,  2.8750,  0.4277, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0625,  1.2031,  2.4375,  0.3223, -1.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:46,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.51 | bwd_microstep: 1.29 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.8750, -1.7891,  2.2656,  0.3457, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5312, -4.4688,  1.1328,  1.9609, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -5.3750, -1.5703,  1.9688, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7969, -2.3281,  0.7422,  1.2422, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9375, -3.2812,  1.1094, -2.0156, -6.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906,  0.4922,  2.9688, -2.7031, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:39:46,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:39:46,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.47 | bwd_microstep: 277.35 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 276.22 | step_microstep: 1.83
[2025-11-06 18:39:46,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 178.99 | bwd: 278.64 | bwd_inner: 2.19 | bwd_allreduce: 276.28 | step: 1.94
 64%|██████▍   | 2243/3507 [55:00<23:13,  1.10s/it]                                                   {'loss': 0.2109, 'learning_rate': 6.07630313220695e-06, 'epoch': 0.64}
 64%|██████▍   | 2243/3507 [55:00<23:13,  1.10s/it]tensor([[-5.8750, -3.4062,  1.2109,  0.6406, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:46,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.81 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.24
tensor([[-4.5000, -3.3906,  0.3438,  2.0312, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -2.5312,  2.1406, -0.5430, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -3.8750, -0.0732,  0.8086, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7812, -4.4375,  0.3906,  2.3906, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -3.0156,  0.5312,  0.6562, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -1.2266,  2.1719,  0.0952, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6328,  1.5000,  2.2344, -1.5312, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:39:48,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:39:48,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.30 | bwd_microstep: 1627.38 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1626.30 | step_microstep: 2.23
[2025-11-06 18:39:48,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.16 | bwd: 1628.59 | bwd_inner: 2.04 | bwd_allreduce: 1626.36 | step: 2.47
 64%|██████▍   | 2244/3507 [55:02<28:49,  1.37s/it]                                                   {'loss': 0.6468, 'learning_rate': 6.067808303983949e-06, 'epoch': 0.64}
 64%|██████▍   | 2244/3507 [55:02<28:49,  1.37s/it]tensor([[-6.6250, -6.4062, -2.3906,  1.7031, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.3125, -3.6250,  0.9336, -0.2324, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-1.8594,  1.2422,  2.2969, -1.3984, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')tensor([3], device='cuda:1')

tensor([2], device='cuda:0')
[2025-11-06 18:39:48,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.58 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.8750, -5.4062,  0.4492,  2.7812, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -3.6250, -0.0233,  2.3281, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3438, -6.4375, -1.5078,  1.6797, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -2.8750,  0.7930,  2.3750, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -3.6094,  0.7930,  1.3281, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:50,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.40 | optimizer_step: 0.30
[2025-11-06 18:39:50,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.31 | bwd_microstep: 1740.51 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1739.35 | step_microstep: 2.81
[2025-11-06 18:39:50,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.92 | bwd: 1741.36 | bwd_inner: 1.73 | bwd_allreduce: 1739.43 | step: 2.92
 64%|██████▍   | 2245/3507 [55:04<33:26,  1.59s/it]                                                   {'loss': 0.1938, 'learning_rate': 6.059316830981954e-06, 'epoch': 0.64}
 64%|██████▍   | 2245/3507 [55:04<33:26,  1.59s/it]tensor([[-4.2500, -4.0938, -1.4609,  1.7812, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781,  0.0742,  1.9766, -2.2031, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -4.1250,  0.5625,  0.4531, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:51,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.91 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.3594,  0.6797,  3.7812,  3.5781, -0.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875,  0.2266,  2.8281, -1.9141, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -2.2500,  2.9531,  0.4141, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.7500, -4.3438,  1.9531,  0.0791, -6.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -4.6250, -1.9766,  2.6094, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:39:51,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:39:51,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.45 | bwd_microstep: 63.46 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 62.38 | step_microstep: 1.62
[2025-11-06 18:39:51,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.37 | bwd: 64.32 | bwd_inner: 1.72 | bwd_allreduce: 62.43 | step: 1.72
 64%|██████▍   | 2246/3507 [55:05<26:12,  1.25s/it]                                                   {'loss': 0.2435, 'learning_rate': 6.050828720446487e-06, 'epoch': 0.64}
 64%|██████▍   | 2246/3507 [55:05<26:12,  1.25s/it]tensor([[-4.0625, -4.1875, -0.9023,  3.2188, -1.4922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -3.6875, -1.2266,  2.7656, -1.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:51,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.57 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6875, -3.9688,  0.1602,  1.1094, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6562, -5.1875,  0.8398,  1.1641, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.3516,  1.5625,  3.6250,  0.6406, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.6562, -5.4062, -0.1748,  2.0312, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5312, -5.3750, -0.5898, -0.0820, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -0.6406,  1.5938, -2.2031, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:52,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:39:52,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.89 | bwd_microstep: 1321.21 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1320.33 | step_microstep: 2.06
[2025-11-06 18:39:52,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.50 | bwd: 1321.86 | bwd_inner: 1.33 | bwd_allreduce: 1320.39 | step: 2.14
 64%|██████▍   | 2247/3507 [55:06<28:42,  1.37s/it]                                                   {'loss': 0.5509, 'learning_rate': 6.042343979620198e-06, 'epoch': 0.64}
 64%|██████▍   | 2247/3507 [55:06<28:42,  1.37s/it]tensor([[-3.4844,  0.1855,  1.6562, -2.7812, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:53,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.57 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.7500, -1.5781, -0.7734,  3.0781,  1.0703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-6.3125, -6.8125, -4.0000,  0.6094, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6094,  0.6602,  2.6250, -1.2422, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -1.3047,  3.0469, -0.3027, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -4.5625, -0.2197,  2.2344, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.6875,  0.2578,  2.5156, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7500, -4.4375,  0.7930,  1.2500, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:53,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:39:53,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.42 | bwd_microstep: 1.47 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.63 | step_microstep: 2.18
[2025-11-06 18:39:53,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 265.00 | bwd: 2.28 | bwd_inner: 1.47 | bwd_allreduce: 0.67 | step: 2.26
 64%|██████▍   | 2248/3507 [55:07<23:35,  1.12s/it]                                                   {'loss': 1.0045, 'learning_rate': 6.033862615742859e-06, 'epoch': 0.64}
 64%|██████▍   | 2248/3507 [55:07<23:35,  1.12s/it]tensor([[-5.0000, -2.3125,  1.5078, -0.0332, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -0.0703,  3.4062, -2.4219, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -4.0000, -1.3906,  1.7266, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:54,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.30 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6250,  0.4551,  1.8828, -3.1875, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.3438, -1.4922,  1.9062,  0.1133, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -1.1406,  2.9844, -1.7188, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -4.2500, -1.3984,  2.8438, -1.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -4.0938,  1.3828,  1.7500, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:56,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.33
[2025-11-06 18:39:56,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.92 | bwd_microstep: 1655.36 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1654.17 | step_microstep: 2.42
[2025-11-06 18:39:56,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.18 | bwd: 1656.30 | bwd_inner: 1.89 | bwd_allreduce: 1654.23 | step: 2.52
 64%|██████▍   | 2249/3507 [55:10<34:36,  1.65s/it]                                                   {'loss': 0.3905, 'learning_rate': 6.025384636051361e-06, 'epoch': 0.64}
 64%|██████▍   | 2249/3507 [55:10<34:36,  1.65s/it]tensor([[-3.8281, -4.8438, -3.2969,  1.4453, -1.1484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -4.2188, -1.9453,  2.7500, -0.8633]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -4.5000, -1.0312,  3.0000, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -3.7812,  0.5664,  1.5469, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:39:56,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2812, -4.7188, -0.1484,  3.4219, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2812, -6.2500, -0.4492,  2.9375, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -3.1250,  1.3984,  3.7656, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -4.3750, -1.3672,  2.7656, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:57,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.13 | optimizer_step: 0.18
[2025-11-06 18:39:57,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.38 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.66
[2025-11-06 18:39:57,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 420.15 | bwd: 2.50 | bwd_inner: 1.53 | bwd_allreduce: 0.83 | step: 2.74
 64%|██████▍   | 2250/3507 [55:11<31:10,  1.49s/it]                                                   {'loss': 0.2055, 'learning_rate': 6.016910047779714e-06, 'epoch': 0.64}
 64%|██████▍   | 2250/3507 [55:11<31:10,  1.49s/it]tensor([[-5.8750, -4.0000,  1.0156,  2.0312, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -1.9219,  1.6562,  2.3906, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -3.2969,  2.2344,  0.8203, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.5000, -0.2236,  2.9219, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:39:57,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.04 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.1387,  3.1719,  2.1875, -1.9141, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.1406,  1.3281,  1.7109, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1719, -3.3750, -2.9219,  1.6328,  0.1514]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.9688, -2.3906,  1.7891,  0.9648, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:39:58,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:39:58,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.60 | bwd_microstep: 742.83 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 741.36 | step_microstep: 1.96
[2025-11-06 18:39:58,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.64 | bwd: 743.70 | bwd_inner: 2.14 | bwd_allreduce: 741.39 | step: 2.03
 64%|██████▍   | 2251/3507 [55:12<29:51,  1.43s/it]                                                   {'loss': 0.7599, 'learning_rate': 6.008438858159025e-06, 'epoch': 0.64}
 64%|██████▍   | 2251/3507 [55:12<29:51,  1.43s/it]tensor([[-1.9219,  1.5781,  3.5625, -0.5664, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6250, -3.4375,  0.9922,  3.1094, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -0.1582,  2.4062, -2.6406, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.3750, -5.2812, -0.3672,  2.3750, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -4.3438,  0.6445,  1.1719, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -3.5469,  0.5508,  1.8594, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -3.7812,  0.6445,  3.1094, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:00,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.31 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7500, -4.5000, -0.5391,  3.2344, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:00,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:40:00,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.19 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.57
[2025-11-06 18:40:00,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.50 | bwd: 2.52 | bwd_inner: 1.64 | bwd_allreduce: 0.76 | step: 2.64
 64%|██████▍   | 2252/3507 [55:14<32:15,  1.54s/it]                                                   {'loss': 0.7247, 'learning_rate': 5.99997107441751e-06, 'epoch': 0.64}
 64%|██████▍   | 2252/3507 [55:14<32:15,  1.54s/it]tensor([[-2.8594, -3.0781, -0.1240,  3.7344, -0.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:00,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.51 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3438, -2.2500,  1.3594,  1.1719, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -3.0312,  0.2236,  3.4062, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1562, -4.5938, -0.2969, -0.4766, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1562, -5.5000,  0.4238,  2.5156, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -6.0938, -2.3594,  1.7266, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -3.2812,  0.7148,  2.9062, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2031,  1.4531,  3.8750, -2.1562, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:40:02,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:40:02,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.33 | bwd_microstep: 300.57 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 299.58 | step_microstep: 1.82
[2025-11-06 18:40:02,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.87 | bwd: 301.46 | bwd_inner: 1.71 | bwd_allreduce: 299.61 | step: 1.90
 64%|██████▍   | 2253/3507 [55:16<33:51,  1.62s/it]                                                   {'loss': 0.2329, 'learning_rate': 5.991506703780475e-06, 'epoch': 0.64}
 64%|██████▍   | 2253/3507 [55:16<33:51,  1.62s/it]tensor([[-5.9375, -2.4844,  2.2812, -0.3906, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.0000, -7.2812, -2.6875,  0.5312, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.8594,  0.3789,  3.7344, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -3.0156,  0.5586,  1.7344, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:02,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.60 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[2.2812, 5.6562, 5.3438, 0.3242, 0.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.6250, -3.9375,  0.2080,  3.1719, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1562,  1.1953,  2.8125, -1.2734, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9844,  0.8828,  4.1250, -2.5469, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:40:02,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:40:02,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.46 | bwd_microstep: 5.26 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 4.52 | step_microstep: 1.86
[2025-11-06 18:40:02,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.07 | bwd: 6.23 | bwd_inner: 1.50 | bwd_allreduce: 4.58 | step: 1.95
 64%|██████▍   | 2254/3507 [55:16<26:08,  1.25s/it]                                                   {'loss': 0.165, 'learning_rate': 5.983045753470308e-06, 'epoch': 0.64}
 64%|██████▍   | 2254/3507 [55:16<26:08,  1.25s/it]tensor([[-6.2188, -4.5625,  0.0781,  1.3750, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:02,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.27 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0312, -3.8438, -0.0796,  1.6953, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0781, -3.9219, -2.1094,  2.4375, -0.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -0.9961,  3.0312,  0.9219, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2812, -4.5938, -0.0459,  3.2656, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0312,  2.2344,  2.3594, -1.8516, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-9.2500, -5.6250, -2.4375, -5.1562, -7.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0938, -5.7500,  0.0090,  2.5469, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:05,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.28 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:40:05,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.12 | bwd_microstep: 2.39 | bwd_inner_microstep: 1.47 | bwd_allreduce_microstep: 0.84 | step_microstep: 3.46
[2025-11-06 18:40:05,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.39 | bwd: 3.21 | bwd_inner: 2.18 | bwd_allreduce: 0.88 | step: 3.55
 64%|██████▍   | 2255/3507 [55:19<35:27,  1.70s/it]                                                   {'loss': 0.8686, 'learning_rate': 5.974588230706484e-06, 'epoch': 0.64}
 64%|██████▍   | 2255/3507 [55:19<35:27,  1.70s/it]tensor([[-5.1250, -4.2188, -0.5938,  1.8672, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3750, -5.0625,  0.1191,  2.5938, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -1.0000,  1.9609,  0.7656, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:05,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.9688, -1.7109,  2.4688,  0.0586, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -0.0400,  3.1875, -2.1562, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -0.9688,  1.0859, -2.5625, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-7.9375, -5.9688, -0.0835,  1.2734, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -0.7852,  3.9219, -0.1787, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:40:06,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:40:06,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.80 | bwd_microstep: 239.70 | bwd_inner_microstep: 1.82 | bwd_allreduce_microstep: 237.64 | step_microstep: 3.56
[2025-11-06 18:40:06,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.40 | bwd: 240.54 | bwd_inner: 2.56 | bwd_allreduce: 237.71 | step: 3.64
 64%|██████▍   | 2256/3507 [55:19<28:37,  1.37s/it]                                                   {'loss': 0.8378, 'learning_rate': 5.966134142705557e-06, 'epoch': 0.64}
 64%|██████▍   | 2256/3507 [55:19<28:37,  1.37s/it]tensor([[-2.5938, -3.1875, -0.4688,  4.3125, -0.0889]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -1.7656,  1.1172,  0.4629, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:06,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.53 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3750, -3.5781,  0.6016,  3.5938, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -3.4062,  1.6328,  3.9062, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -0.0260,  3.2969, -0.7109, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -4.1875,  0.1465,  2.5469, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0781, -3.8281, -2.5000,  1.6328, -0.7461]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -3.3281,  0.2930, -0.0398, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:08,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.60 | optimizer_step: 0.56
[2025-11-06 18:40:08,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.14 | bwd_microstep: 5.62 | bwd_inner_microstep: 2.89 | bwd_allreduce_microstep: 2.45 | step_microstep: 4.89
[2025-11-06 18:40:08,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.69 | bwd: 6.42 | bwd_inner: 3.60 | bwd_allreduce: 2.51 | step: 5.02
 64%|██████▍   | 2257/3507 [55:22<35:54,  1.72s/it]                                                   {'loss': 0.5268, 'learning_rate': 5.957683496681143e-06, 'epoch': 0.64}
 64%|██████▍   | 2257/3507 [55:22<35:54,  1.72s/it]tensor([[-5.1562, -4.1875, -0.1099,  2.0156, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -2.0938,  1.8125,  0.4160, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0781, -3.6250, -1.2891,  2.9688, -0.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:40:08,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.20 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.7812,  0.7695,  4.0625, -2.0156, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -5.1250, -0.3398,  3.4062, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2188, -3.7656, -0.9375,  3.5625, -0.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -0.6641,  3.2812, -1.6875, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -0.5156,  4.0312, -0.7344, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:40:09,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.33 | optimizer_step: 0.35
[2025-11-06 18:40:09,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.55 | bwd_microstep: 111.48 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 110.00 | step_microstep: 3.25
[2025-11-06 18:40:09,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.78 | bwd: 112.48 | bwd_inner: 2.20 | bwd_allreduce: 110.08 | step: 3.34
 64%|██████▍   | 2258/3507 [55:23<28:19,  1.36s/it]                                                   {'loss': 0.5213, 'learning_rate': 5.949236299843925e-06, 'epoch': 0.64}
 64%|██████▍   | 2258/3507 [55:23<28:19,  1.36s/it]tensor([[-6.8125, -4.6875,  1.0078,  1.9297, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:09,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.96 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8125, -1.9297,  1.2578, -0.7188, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.6250, -6.0000, -1.0703, -1.3281, -6.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -4.0938,  1.6641,  1.5156, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -5.3125, -1.7578,  1.6406, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844,  0.2637,  3.5938, -0.2871, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6250, -4.6250,  1.2500,  2.3125, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -5.7188, -1.3906,  2.3594, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:12,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.25 | optimizer_gradients: 0.22 | optimizer_step: 0.23
[2025-11-06 18:40:12,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.61 | bwd_microstep: 2.27 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1.02 | step_microstep: 4.38
[2025-11-06 18:40:12,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.57 | bwd: 3.19 | bwd_inner: 1.97 | bwd_allreduce: 1.05 | step: 4.46
 64%|██████▍   | 2259/3507 [55:26<38:49,  1.87s/it]                                                   {'loss': 0.4147, 'learning_rate': 5.940792559401648e-06, 'epoch': 0.64}
 64%|██████▍   | 2259/3507 [55:26<38:49,  1.87s/it]tensor([[-4.6250, -4.8750, -1.6016,  2.7656, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -4.4375, -0.8203,  3.1406, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -0.8516,  2.5000,  0.2451, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:12,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.70 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.4062, -2.1406,  2.6719, -1.7734, -5.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -4.5000, -0.1611,  2.8750, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1406, -1.2266,  2.4219,  2.9219, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0625, -2.6562,  0.6445,  3.7812, -0.9570]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -2.8594,  2.2188,  0.1768, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:40:12,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:40:12,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.14 | bwd_microstep: 125.32 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 124.19 | step_microstep: 2.26
[2025-11-06 18:40:12,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.87 | bwd: 126.35 | bwd_inner: 1.95 | bwd_allreduce: 124.25 | step: 2.36
 64%|██████▍   | 2260/3507 [55:26<30:22,  1.46s/it]                                                   {'loss': 0.398, 'learning_rate': 5.932352282559093e-06, 'epoch': 0.64}
 64%|██████▍   | 2260/3507 [55:26<30:22,  1.46s/it]tensor([[-2.8125,  1.3750,  3.0312, -2.3438, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125,  0.5938,  2.9219, -0.5938, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:40:12,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.09 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.8594, -1.3359,  1.8516,  0.1611, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.6406,  0.3711,  3.2812, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -1.3594,  1.6406,  0.7148, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -1.8594,  2.7344,  0.1494, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -4.6562, -0.6875,  2.6406, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -2.7500,  0.5469,  1.7656, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:15,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.21 | optimizer_step: 0.25
[2025-11-06 18:40:15,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.67 | bwd_microstep: 2.20 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1.02 | step_microstep: 2.91
[2025-11-06 18:40:15,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.77 | bwd: 3.18 | bwd_inner: 1.95 | bwd_allreduce: 1.06 | step: 3.01
 64%|██████▍   | 2261/3507 [55:29<39:09,  1.89s/it]                                                   {'loss': 0.614, 'learning_rate': 5.923915476518097e-06, 'epoch': 0.64}
 64%|██████▍   | 2261/3507 [55:29<39:09,  1.89s/it]tensor([[-5.5938, -5.2188, -1.5625,  1.8516, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -2.0469,  2.0625,  3.9531, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -2.9062,  1.4297,  0.2178, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:15,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.63 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.20
tensor([[-3.9375,  0.5664,  3.0156, -2.8125, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6094,  1.0625,  2.6250, -1.8750, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.0312, -5.0000, -1.3047,  2.9375, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -3.3281,  0.7773,  1.3281, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -3.9844,  0.4551,  1.3047, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:40:16,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:40:16,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.05 | bwd_microstep: 14.57 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 13.72 | step_microstep: 1.79
[2025-11-06 18:40:16,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.72 | bwd: 15.54 | bwd_inner: 1.62 | bwd_allreduce: 13.76 | step: 2.00
 64%|██████▍   | 2262/3507 [55:29<30:11,  1.46s/it]                                                   {'loss': 0.3956, 'learning_rate': 5.915482148477537e-06, 'epoch': 0.64}
 64%|██████▍   | 2262/3507 [55:29<30:11,  1.46s/it]tensor([[-4.1875, -2.8906,  0.6133,  2.1406, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500,  0.8867,  3.8594, -3.3594, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -4.5000, -0.7969,  3.2344, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -2.6250,  1.3672,  3.0781, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:16,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.18 | bwd_microstep: 5.29 | bwd_inner_microstep: 5.18 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.6680,  2.1719,  1.1953, -1.8594, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.9062, -2.5781,  2.0312,  0.0503, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125,  0.7852,  3.9375, -1.8516, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[h264 @ 0x931cf40] mmco: unref short failure
tensor([[-5.7188, -1.4453,  3.5469, -0.7812, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:17,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.34 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:40:17,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.56 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.80 | step_microstep: 3.84
[2025-11-06 18:40:17,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.78 | bwd: 6.93 | bwd_inner: 5.95 | bwd_allreduce: 0.83 | step: 3.92
 65%|██████▍   | 2263/3507 [55:31<28:59,  1.40s/it]                                                   {'loss': 0.3809, 'learning_rate': 5.907052305633315e-06, 'epoch': 0.65}
 65%|██████▍   | 2263/3507 [55:31<28:59,  1.40s/it]tensor([[-1.5391,  1.7266,  2.3594, -1.4688, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -4.0000, -0.2139,  2.0156, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250,  0.9492,  3.1406, -1.9453, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:17,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.09 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -3.4219,  0.3633,  2.7812, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -4.6875, -1.2969,  3.0469, -1.7109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -2.1562,  2.0312,  1.6250, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -2.5156,  1.7344,  0.7617, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -3.6250,  0.2002,  2.0000, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:40:17,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.17
[2025-11-06 18:40:17,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.72 | bwd_microstep: 59.06 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 58.02 | step_microstep: 1.90
[2025-11-06 18:40:17,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.84 | bwd: 59.71 | bwd_inner: 1.48 | bwd_allreduce: 58.06 | step: 1.98
 65%|██████▍   | 2264/3507 [55:31<23:16,  1.12s/it]                                                   {'loss': 0.4004, 'learning_rate': 5.898625955178362e-06, 'epoch': 0.65}
 65%|██████▍   | 2264/3507 [55:31<23:16,  1.12s/it]tensor([[-5.0625, -5.9062, -3.1875,  1.9297, -1.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -0.4062,  3.1406, -1.7344, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -4.3125, -0.6328,  3.5000, -1.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -3.7188,  0.4004,  2.9375, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:18,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.22 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -3.0156,  1.2969,  3.1250, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -5.2500, -1.8047,  2.8125, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -2.2656,  1.4297,  1.2656, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -4.5000,  0.3652,  2.7812, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:21,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:40:21,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.23 | bwd_microstep: 2.19 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.45
[2025-11-06 18:40:21,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.46 | bwd: 2.83 | bwd_inner: 1.85 | bwd_allreduce: 0.85 | step: 2.54
 65%|██████▍   | 2265/3507 [55:35<38:53,  1.88s/it]                                                   {'loss': 0.1534, 'learning_rate': 5.890203104302634e-06, 'epoch': 0.65}
 65%|██████▍   | 2265/3507 [55:35<38:53,  1.88s/it]tensor([[-4.5312, -4.0938, -0.5508,  2.5625, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -3.2188,  0.7812,  1.0625, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.3125, -6.7188, -0.6445,  1.8906, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -3.2500,  1.8359,  1.5625, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:21,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.91 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.3438, -1.4609,  3.1094, -0.5781, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -0.8516,  3.5938, -0.1992, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -2.3281,  3.1562, -0.9336, -6.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594, -0.7383,  1.8047,  0.1943, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:22,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.19 | optimizer_step: 0.15
[2025-11-06 18:40:22,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.66 | bwd_microstep: 1.83 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.40
[2025-11-06 18:40:22,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 482.61 | bwd: 2.83 | bwd_inner: 1.90 | bwd_allreduce: 0.77 | step: 1.50
 65%|██████▍   | 2266/3507 [55:35<30:29,  1.47s/it]                                                   {'loss': 0.7212, 'learning_rate': 5.881783760193093e-06, 'epoch': 0.65}
 65%|██████▍   | 2266/3507 [55:35<30:29,  1.47s/it]tensor([[-3.0938, -3.8906, -2.5781,  1.4219, -0.7539]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.5781,  0.5586,  3.0781, -1.7734, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:22,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.61 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.0312, -5.7812, -2.2812,  1.0547, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -4.5312, -0.4043,  2.7500, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[ 1.3428e-03,  2.8750e+00,  1.9688e+00, -1.5000e+00, -8.8672e-01]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6250, -2.2188,  1.8750, -0.9219, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -0.8984,  2.3906,  0.2910, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188,  0.1982,  3.3125, -1.0781, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:29,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.12 | optimizer_step: 0.14
[2025-11-06 18:40:29,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.95 | bwd_microstep: 1.83 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.87
[2025-11-06 18:40:29,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.60 | bwd: 2.89 | bwd_inner: 1.96 | bwd_allreduce: 0.79 | step: 2.99
 65%|██████▍   | 2267/3507 [55:43<1:06:13,  3.20s/it]                                                     {'loss': 0.398, 'learning_rate': 5.87336793003371e-06, 'epoch': 0.65}
 65%|██████▍   | 2267/3507 [55:43<1:06:13,  3.20s/it]tensor([[-3.9531, -4.7188, -2.3594,  2.5938, -1.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -2.7969,  2.6719, -0.3438, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -2.5781,  0.9727,  1.0078, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7969, -0.3672,  4.1875,  1.4297, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -3.7500,  1.6641,  2.6406, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -3.1875,  0.3711,  0.9648, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:29,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 483.42 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.6562, -1.7734,  3.2812, -0.2520, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -4.4375, -0.0654,  2.8281, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:30,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:40:30,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.56 | bwd_microstep: 1.68 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.61
[2025-11-06 18:40:30,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 667.99 | bwd: 2.54 | bwd_inner: 1.66 | bwd_allreduce: 0.78 | step: 1.68
 65%|██████▍   | 2268/3507 [55:43<50:47,  2.46s/it]                                                     {'loss': 0.3433, 'learning_rate': 5.864955621005465e-06, 'epoch': 0.65}
 65%|██████▍   | 2268/3507 [55:43<50:47,  2.46s/it]tensor([[-5.3750, -2.4219,  1.4922, -0.1865, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -2.6250,  2.2188,  0.9922, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -4.4688, -0.3359,  2.5625, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:30,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.55 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.6953,  1.1484,  1.9297, -0.8633, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.6875, -5.7188,  0.2051,  1.6875, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7188, -3.7656, -2.4219,  2.2812, -0.2129]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9688, -5.6562, -1.3359,  0.6016, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.2500, -0.5195,  1.9922, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:31,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:40:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.07 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.07
[2025-11-06 18:40:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.64 | bwd: 2.69 | bwd_inner: 1.67 | bwd_allreduce: 0.87 | step: 2.15
 65%|██████▍   | 2269/3507 [55:45<43:25,  2.10s/it]                                                   {'loss': 0.2855, 'learning_rate': 5.856546840286325e-06, 'epoch': 0.65}
 65%|██████▍   | 2269/3507 [55:45<43:25,  2.10s/it]tensor([[-5.4688, -5.2188, -1.6875,  1.8438, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -2.5312,  1.7891,  1.4062, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:31,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 199.71 | bwd_microstep: 10.68 | bwd_inner_microstep: 10.55 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.6562, -4.2812,  0.7734,  0.8984, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -4.3750,  1.6172,  1.8281, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812, -3.9531,  1.6641,  0.8516, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -4.6250, -1.0859,  2.1406, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3125, -1.7422,  2.9844,  2.4062, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -3.1719,  1.4688,  0.8242, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:40:31,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:40:31,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.67 | bwd_microstep: 159.47 | bwd_inner_microstep: 4.01 | bwd_allreduce_microstep: 155.35 | step_microstep: 2.03
[2025-11-06 18:40:31,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.41 | bwd: 170.14 | bwd_inner: 14.58 | bwd_allreduce: 155.39 | step: 2.12
 65%|██████▍   | 2270/3507 [55:45<34:03,  1.65s/it]                                                   {'loss': 0.6157, 'learning_rate': 5.848141595051256e-06, 'epoch': 0.65}
 65%|██████▍   | 2270/3507 [55:45<34:03,  1.65s/it]tensor([[-5.7500, -5.5000, -2.2500,  1.3750, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:32,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.41 | bwd_microstep: 6.56 | bwd_inner_microstep: 6.41 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.7188, -5.6562, -1.3203,  0.9961, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -3.9531, -0.6914,  3.1250, -1.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -4.0000,  0.0728,  0.2061, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -4.7188, -0.3145,  2.1562, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7344,  1.8047,  3.7500, -2.3594, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -4.5625,  0.9766,  1.3125, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -4.5938, -0.6367,  2.0312, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:35,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:40:35,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.94 | bwd_microstep: 1.99 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.86 | step_microstep: 3.43
[2025-11-06 18:40:35,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.36 | bwd: 8.55 | bwd_inner: 7.48 | bwd_allreduce: 0.91 | step: 3.55
 65%|██████▍   | 2271/3507 [55:49<46:23,  2.25s/it]                                                   {'loss': 0.2244, 'learning_rate': 5.839739892472192e-06, 'epoch': 0.65}
 65%|██████▍   | 2271/3507 [55:49<46:23,  2.25s/it]tensor([[-5.1250, -2.3438,  1.9219,  0.4902, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500,  0.1797,  3.9844, -0.8242, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.5625, -0.2617,  2.5938, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:35,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.10 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.0625, -0.2275,  3.5156, -0.0840, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -4.1562, -1.2969,  2.0469, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -5.5625, -1.6250,  2.8906, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -4.1250, -0.0771,  2.2188, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -2.4375,  1.1250, -0.7227, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:40:35,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:40:35,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.52 | bwd_microstep: 72.44 | bwd_inner_microstep: 9.22 | bwd_allreduce_microstep: 63.06 | step_microstep: 9.56
[2025-11-06 18:40:35,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.67 | bwd: 73.64 | bwd_inner: 10.32 | bwd_allreduce: 63.12 | step: 9.66
 65%|██████▍   | 2272/3507 [55:49<35:23,  1.72s/it]                                                   {'loss': 0.0903, 'learning_rate': 5.831341739718055e-06, 'epoch': 0.65}
 65%|██████▍   | 2272/3507 [55:49<35:23,  1.72s/it]tensor([[-3.8281,  0.0698,  1.1641, -3.3750, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.8281, -4.5625, -2.4062,  2.0938, -1.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:36,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.09 | bwd_microstep: 7.53 | bwd_inner_microstep: 7.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-5.1875, -2.9375,  0.8672,  0.7578, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -1.8516,  2.0625,  1.9453, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000,  0.0854,  2.7500,  0.5703, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7734,  2.0156,  3.0625, -2.0312, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -1.3828,  2.8281,  0.4004, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -3.7812,  0.6719,  0.0674, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:38,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:40:38,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.71 | bwd_microstep: 2.27 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.97 | step_microstep: 2.40
[2025-11-06 18:40:38,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.83 | bwd: 9.79 | bwd_inner: 8.57 | bwd_allreduce: 1.03 | step: 2.51
 65%|██████▍   | 2273/3507 [55:52<38:49,  1.89s/it]                                                   {'loss': 0.5654, 'learning_rate': 5.8229471439547436e-06, 'epoch': 0.65}
 65%|██████▍   | 2273/3507 [55:52<38:49,  1.89s/it]tensor([[-4.5625, -4.4062, -1.4062,  2.2812, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -5.3750, -2.7344,  2.6094, -1.4609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:38,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.83 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-4.4062, -0.8594,  3.1250, -0.2676, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -4.0938,  0.3418,  3.6562, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -1.8438,  1.8359, -0.5859, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -1.8750,  1.6953, -0.1523, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -4.4375, -0.9492,  2.7812, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -1.4766,  1.5625,  0.4590, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:40:40,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:40:40,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.13 | bwd_microstep: 1726.70 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1725.65 | step_microstep: 1.95
[2025-11-06 18:40:40,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.99 | bwd: 1727.39 | bwd_inner: 1.60 | bwd_allreduce: 1725.68 | step: 2.02
 65%|██████▍   | 2274/3507 [55:54<40:10,  1.96s/it]                                                   {'loss': 0.1009, 'learning_rate': 5.8145561123451086e-06, 'epoch': 0.65}
 65%|██████▍   | 2274/3507 [55:54<40:10,  1.96s/it]tensor([[-6.6875, -4.1250,  0.8242,  0.4082, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -1.5781,  1.3906, -0.6797, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.7500, -3.6250,  0.5938,  2.4219, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:40,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.68 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5625, -4.3438, -0.2500,  1.7891, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -1.7891,  1.7188, -0.3340, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7031,  1.2734,  1.9219, -1.1484, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-1.9141,  1.0938,  3.2969,  0.5430, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.0625, -5.1562,  1.1172,  0.4180, -6.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:40:40,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:40:40,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 75.88 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 74.98 | step_microstep: 3.10
[2025-11-06 18:40:40,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.55 | bwd: 76.70 | bwd_inner: 1.52 | bwd_allreduce: 75.02 | step: 3.18
 65%|██████▍   | 2275/3507 [55:54<31:03,  1.51s/it]                                                   {'loss': 0.724, 'learning_rate': 5.806168652048967e-06, 'epoch': 0.65}
 65%|██████▍   | 2275/3507 [55:54<31:03,  1.51s/it]tensor([[-4.5625, -0.4492,  3.1562, -1.1328, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:41,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.28 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -4.7812, -3.1250,  0.8008, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -4.4062, -1.5312,  2.7656, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875, -1.9297,  1.7812,  3.4062, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2188, -3.9844,  1.1797,  1.4375, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -3.6875,  0.2012,  1.9375, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.5938, -0.2539,  3.2656, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -4.6875,  0.2070,  3.8438, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:40:41,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:40:41,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.18 | bwd_microstep: 663.16 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 662.12 | step_microstep: 1.63
[2025-11-06 18:40:41,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.49 | bwd: 663.87 | bwd_inner: 1.57 | bwd_allreduce: 662.16 | step: 1.71
 65%|██████▍   | 2276/3507 [55:55<27:58,  1.36s/it]                                                   {'loss': 0.1806, 'learning_rate': 5.797784770223085e-06, 'epoch': 0.65}
 65%|██████▍   | 2276/3507 [55:55<27:58,  1.36s/it]tensor([[-5.1250, -1.2969,  2.9375, -0.4980, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -4.8438, -0.8477,  3.4062, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -0.8906,  2.1875, -0.8672, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0625, -1.6797,  1.6016, -1.0312, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:42,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.82 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.7812, -4.8438,  0.8242,  2.0156, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -0.6523,  1.9766, -0.1553, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812,  0.9492,  2.8281, -1.2422, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2188, -5.2188, -0.9883,  1.6328, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:42,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.34 | optimizer_step: 0.20
[2025-11-06 18:40:42,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.76 | bwd_microstep: 2.17 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1.17 | step_microstep: 1.99
[2025-11-06 18:40:42,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.61 | bwd: 3.06 | bwd_inner: 1.70 | bwd_allreduce: 1.21 | step: 2.07
 65%|██████▍   | 2277/3507 [55:56<22:06,  1.08s/it]                                                   {'loss': 0.5044, 'learning_rate': 5.789404474021178e-06, 'epoch': 0.65}
 65%|██████▍   | 2277/3507 [55:56<22:06,  1.08s/it]tensor([[-3.7500, -4.1875, -1.7891,  2.5000, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -3.1562,  0.9844,  1.2812, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3672,  2.0938,  1.0703, -1.1328, -0.7773]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9531, -3.3594,  0.2715,  3.0156, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9062, -3.0781, -1.2344,  1.8438, -0.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:40:44,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.46 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9375, -4.1562,  0.1748,  2.9844, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -2.0000,  0.8086, -0.6992, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -4.0625, -0.6289,  2.3750, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:40:46,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:40:46,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.68 | bwd_microstep: 1896.20 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1895.03 | step_microstep: 2.35
[2025-11-06 18:40:46,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.16 | bwd: 1897.05 | bwd_inner: 1.84 | bwd_allreduce: 1895.07 | step: 2.41
 65%|██████▍   | 2278/3507 [56:00<41:57,  2.05s/it]                                                   {'loss': 0.5813, 'learning_rate': 5.781027770593901e-06, 'epoch': 0.65}
 65%|██████▍   | 2278/3507 [56:00<41:57,  2.05s/it]tensor([[-3.1875,  0.4766,  2.2969, -1.8750, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -2.1250,  1.8516, -1.3438, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -3.4531,  2.1719,  1.5078, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9844, -3.8750, -1.9297,  2.7812, -0.4258]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:46,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.65 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7500, -1.7578,  1.2734, -2.8281, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9844, -4.6250, -2.1719,  2.4062, -1.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -5.9688, -1.9297,  1.8906, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.8125, -5.0312, -0.9062, -0.1875, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:40:46,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:40:46,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.13 | bwd_microstep: 3.15 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1.97 | step_microstep: 1.86
[2025-11-06 18:40:46,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.82 | bwd: 4.10 | bwd_inner: 1.96 | bwd_allreduce: 2.00 | step: 1.94
 65%|██████▍   | 2279/3507 [56:00<31:36,  1.54s/it]                                                   {'loss': 0.2444, 'learning_rate': 5.772654667088842e-06, 'epoch': 0.65}
 65%|██████▍   | 2279/3507 [56:00<31:36,  1.54s/it]tensor([[-4.7500, -2.5469,  2.0625,  2.2031, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -2.1406,  1.4766, -0.3477, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7500, -4.9688,  0.6406,  2.1250, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:47,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.73 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3438, -5.2500, -0.5547,  2.0312, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -1.3828,  1.3438, -1.9062, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.3750, -5.3750, -1.5781, -3.2656, -6.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1094,  0.8242,  3.3281, -1.0781, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9688, -3.1250,  2.0938, -0.9805, -5.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:40:49,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.18 | optimizer_step: 0.24
[2025-11-06 18:40:49,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.96 | bwd_microstep: 1624.64 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1623.64 | step_microstep: 3.44
[2025-11-06 18:40:49,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.72 | bwd: 1625.58 | bwd_inner: 1.77 | bwd_allreduce: 1623.67 | step: 3.54
 65%|██████▌   | 2280/3507 [56:02<34:42,  1.70s/it]                                                   {'loss': 0.4204, 'learning_rate': 5.764285170650521e-06, 'epoch': 0.65}
 65%|██████▌   | 2280/3507 [56:02<34:42,  1.70s/it]tensor([[-3.7031,  0.8945,  4.0000, -1.8047, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -3.6719,  1.8828,  0.8008, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:49,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.16 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-5.0000, -2.8594,  1.5547,  1.6250, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -0.8945,  2.5938, -1.1953, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -1.4531,  2.5312,  0.1543, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562,  0.7305,  3.3438, -2.1562, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5938, -2.9844,  2.4688, -0.0466, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -1.4062,  2.5156,  1.7734, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:40:49,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.29 | optimizer_step: 0.34
[2025-11-06 18:40:49,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.71 | bwd_microstep: 155.47 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 154.34 | step_microstep: 2.91
[2025-11-06 18:40:49,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.87 | bwd: 156.71 | bwd_inner: 2.09 | bwd_allreduce: 154.43 | step: 3.04
 65%|██████▌   | 2281/3507 [56:03<27:24,  1.34s/it]                                                   {'loss': 0.2169, 'learning_rate': 5.7559192884203756e-06, 'epoch': 0.65}
 65%|██████▌   | 2281/3507 [56:03<27:24,  1.34s/it]tensor([[-5.4375, -2.1719,  0.9766, -1.8438, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5938, -2.5156,  0.3223,  3.9062, -0.4434]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.4688,  0.2949,  2.0469, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:49,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.06 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.1562,  1.5781,  2.0000, -2.5312, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6562, -3.1562,  0.9258,  2.3281, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -2.7969,  0.7500,  1.0703, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -2.4531,  2.5469, -0.6797, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -1.1641,  3.1875, -0.1436, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:40:51,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:40:51,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.80 | bwd_microstep: 1705.53 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 1704.70 | step_microstep: 2.79
[2025-11-06 18:40:51,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.90 | bwd: 1706.35 | bwd_inner: 1.49 | bwd_allreduce: 1704.74 | step: 2.87
 65%|██████▌   | 2282/3507 [56:05<31:58,  1.57s/it]                                                   {'loss': 0.7114, 'learning_rate': 5.747557027536763e-06, 'epoch': 0.65}
 65%|██████▌   | 2282/3507 [56:05<31:58,  1.57s/it]tensor([[-2.7969, -3.7188, -2.2344,  2.2812, -0.3418]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -4.9375, -0.7812,  3.2344, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -6.1562, -2.7344,  1.7500, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1875,  1.3203,  2.9219, -1.2031, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:51,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.25 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.2656,  2.1250,  2.3594, -2.2969, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.6875, -0.4551,  3.1250, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -2.8750,  1.1094,  0.9922, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -2.9844,  2.4531,  1.9531, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:40:52,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:40:52,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.24 | bwd_microstep: 17.63 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 16.28 | step_microstep: 2.76
[2025-11-06 18:40:52,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.52 | bwd: 18.50 | bwd_inner: 2.07 | bwd_allreduce: 16.31 | step: 2.83
 65%|██████▌   | 2283/3507 [56:05<25:09,  1.23s/it]                                                   {'loss': 0.2784, 'learning_rate': 5.739198395134947e-06, 'epoch': 0.65}
 65%|██████▌   | 2283/3507 [56:05<25:09,  1.23s/it]tensor([[-0.7812,  1.9297,  2.1250, -1.1562, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:40:52,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 86.37 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.3125, -2.9062,  2.0625, -0.2314, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -1.8438,  0.4258,  0.0287, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6250, -5.4375, -0.2539,  2.3281, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.5312,  0.5312,  2.0469, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094,  1.2656,  3.4219, -1.2500, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.2188, -1.8750,  1.7266,  0.9492, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -3.5625,  0.1113,  2.7188, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:40:54,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.46 | optimizer_step: 0.41
[2025-11-06 18:40:54,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.71 | bwd_microstep: 2219.01 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 2217.87 | step_microstep: 10.51
[2025-11-06 18:40:54,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 163.09 | bwd: 2220.06 | bwd_inner: 2.02 | bwd_allreduce: 2217.92 | step: 10.62
 65%|██████▌   | 2284/3507 [56:08<32:24,  1.59s/it]                                                   {'loss': 0.6747, 'learning_rate': 5.730843398347101e-06, 'epoch': 0.65}
 65%|██████▌   | 2284/3507 [56:08<32:24,  1.59s/it]tensor([[-5.5000, -1.4922,  2.6875, -1.1016, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -2.4688,  1.6953,  2.6875, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:40:54,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.14 | bwd_microstep: 1.67 | bwd_inner_microstep: 1.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -1.9844,  1.2031, -0.0297, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -5.6875, -0.7188,  3.2031, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -3.5469,  1.2500,  1.7422, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812e+00, -4.9062e+00, -7.6562e-01, -6.1035e-05, -4.7188e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -4.7188, -1.3906,  2.6562, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -4.8125,  0.6836,  2.2031, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:40:55,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:40:55,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.26 | bwd_microstep: 240.90 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 239.86 | step_microstep: 6.36
[2025-11-06 18:40:55,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.43 | bwd: 242.57 | bwd_inner: 2.53 | bwd_allreduce: 239.90 | step: 6.45
 65%|██████▌   | 2285/3507 [56:08<26:25,  1.30s/it]                                                   {'loss': 0.4397, 'learning_rate': 5.722492044302286e-06, 'epoch': 0.65}
 65%|██████▌   | 2285/3507 [56:08<26:25,  1.30s/it]tensor([[-5.5312, -3.4688,  1.5000,  1.8125, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0312, -4.8125, -1.2656,  2.2500, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:40:55,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.58 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([3], device='cuda:3')
tensor([[-5.7812, -3.5938,  1.6719,  1.8906, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -0.2363,  2.6250, -1.1094, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -1.6797,  0.3535, -3.4531, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -3.8750, -1.0703,  1.2109, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -5.1875, -2.1250,  1.4844, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -3.1562,  1.3281,  1.5547, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:40:58,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.23 | optimizer_step: 0.34
[2025-11-06 18:40:58,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.77 | bwd_microstep: 2835.72 | bwd_inner_microstep: 6.47 | bwd_allreduce_microstep: 2829.15 | step_microstep: 3.38
[2025-11-06 18:40:58,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.37 | bwd: 2836.82 | bwd_inner: 7.46 | bwd_allreduce: 2829.21 | step: 3.47
 65%|██████▌   | 2286/3507 [56:12<38:25,  1.89s/it]                                                   {'loss': 0.3126, 'learning_rate': 5.714144340126471e-06, 'epoch': 0.65}
 65%|██████▌   | 2286/3507 [56:12<38:25,  1.89s/it]tensor([[-5.6250, -1.1953,  2.7656, -2.1562, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -3.6875,  0.2188,  3.8125, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -0.6602,  3.6094, -1.5391, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:58,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.24 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9688, -4.5625,  0.0723,  1.8359, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1016,  2.3281,  2.1719, -2.2969, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.7188,  0.1069,  3.8281,  2.1719, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -3.9219, -1.4141,  3.3750, -0.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -3.3594,  1.2656,  1.1641, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:40:58,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:40:58,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.08 | bwd_microstep: 70.05 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 69.07 | step_microstep: 14.71
[2025-11-06 18:40:58,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.34 | bwd: 70.77 | bwd_inner: 1.51 | bwd_allreduce: 69.11 | step: 14.80
 65%|██████▌   | 2287/3507 [56:12<29:33,  1.45s/it]                                                   {'loss': 0.4421, 'learning_rate': 5.705800292942498e-06, 'epoch': 0.65}
 65%|██████▌   | 2287/3507 [56:12<29:33,  1.45s/it]tensor([[-5.1562, -4.3438, -0.8906,  1.2266, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:40:59,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.69 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6562, -2.8750,  0.7656,  2.9688, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0469, -0.1748,  3.3750,  1.5938, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -1.6641,  1.9609,  0.5586, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -3.5469,  1.0000,  2.2188, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -2.9375,  1.2500,  1.0391, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -5.4375, -1.4375,  1.8125, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.7031,  0.4297,  2.2812, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:02,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:41:02,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.89 | bwd_microstep: 3074.53 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 3073.57 | step_microstep: 2.79
[2025-11-06 18:41:02,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.61 | bwd: 3075.22 | bwd_inner: 1.45 | bwd_allreduce: 3073.62 | step: 2.88
 65%|██████▌   | 2288/3507 [56:16<41:43,  2.05s/it]                                                   {'loss': 0.7063, 'learning_rate': 5.697459909870084e-06, 'epoch': 0.65}
 65%|██████▌   | 2288/3507 [56:16<41:43,  2.05s/it]tensor([[-5.3438, -4.5000,  0.0952,  2.8594, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:02,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.91 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.77 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.24
tensor([[-3.6406, -2.5625,  0.1504,  1.2422, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -3.2031,  0.6016,  1.8828, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -3.5625,  1.1328,  2.8594, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -5.9375, -2.6406,  1.9219, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -3.9062, -0.3047,  2.1406, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500,  1.0625,  3.0469, -1.7812, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -5.2500, -2.5000,  0.8008, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:41:02,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.63 | optimizer_gradients: 0.32 | optimizer_step: 2.47
[2025-11-06 18:41:02,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.10 | bwd_microstep: 113.03 | bwd_inner_microstep: 2.32 | bwd_allreduce_microstep: 110.58 | step_microstep: 11.22
[2025-11-06 18:41:02,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.08 | bwd: 115.18 | bwd_inner: 4.12 | bwd_allreduce: 110.72 | step: 11.46
 65%|██████▌   | 2289/3507 [56:16<31:53,  1.57s/it]                                                   {'loss': 1.0006, 'learning_rate': 5.689123198025836e-06, 'epoch': 0.65}
 65%|██████▌   | 2289/3507 [56:16<31:53,  1.57s/it]tensor([[-3.3906,  0.6328,  2.8594, -1.9062, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -4.0625, -1.0859,  3.1250, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:02,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.90 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.5938, -3.2969,  2.1719,  2.2500, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -3.0469,  0.3828,  2.6562, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1562, -1.1719,  1.5000,  1.1484, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -2.4219,  2.8438,  0.8633, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3906,  1.6953,  3.2656, -2.1562, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -4.7500, -1.2344,  1.8828, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:05,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.27 | optimizer_step: 0.23
[2025-11-06 18:41:05,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.63 | bwd_microstep: 2295.16 | bwd_inner_microstep: 1.72 | bwd_allreduce_microstep: 2293.33 | step_microstep: 3.02
[2025-11-06 18:41:05,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 283.54 | bwd: 2296.14 | bwd_inner: 2.59 | bwd_allreduce: 2293.39 | step: 3.13
 65%|██████▌   | 2290/3507 [56:19<38:20,  1.89s/it]                                                   {'loss': 0.2464, 'learning_rate': 5.6807901645232175e-06, 'epoch': 0.65}
 65%|██████▌   | 2290/3507 [56:19<38:20,  1.89s/it]tensor([[-3.5781,  0.6484,  2.7344, -2.0781, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2188,  0.0113,  3.8125,  3.1406, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:41:05,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.32 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.5625, -4.3438, -0.5547,  1.3203, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -1.3359,  1.1172,  0.6797, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8125, -3.5156, -0.0977,  3.1094, -1.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -4.1562, -1.6172,  2.7812, -1.1172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500, -2.0469,  1.1641,  1.3203, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -5.0625, -1.8125,  1.5859, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:05,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:41:05,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.96 | bwd_microstep: 43.51 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 42.41 | step_microstep: 15.50
[2025-11-06 18:41:05,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.31 | bwd: 44.35 | bwd_inner: 1.73 | bwd_allreduce: 42.46 | step: 15.58
 65%|██████▌   | 2291/3507 [56:19<29:40,  1.46s/it]                                                   {'loss': 0.2597, 'learning_rate': 5.672460816472556e-06, 'epoch': 0.65}
 65%|██████▌   | 2291/3507 [56:19<29:40,  1.46s/it]tensor([[-6.3125, -3.6719,  1.3672,  0.8242, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:06,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.93 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -4.3438,  0.4434,  1.8438, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5000, -4.8438,  1.2734,  1.2031, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -1.8594,  3.0625, -0.2793, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -2.1094,  0.3613, -2.4531, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -3.4531,  0.7773,  2.4688, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -4.7188, -2.2344,  1.9141, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8906, -0.1465,  3.2969,  1.7344, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:09,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.26 | optimizer_step: 0.37
[2025-11-06 18:41:09,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.09 | bwd_microstep: 3341.57 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 3340.48 | step_microstep: 3.04
[2025-11-06 18:41:09,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.03 | bwd: 3342.30 | bwd_inner: 1.61 | bwd_allreduce: 3340.53 | step: 3.12
 65%|██████▌   | 2292/3507 [56:23<43:08,  2.13s/it]                                                   {'loss': 0.3239, 'learning_rate': 5.664135160981032e-06, 'epoch': 0.65}
 65%|██████▌   | 2292/3507 [56:23<43:08,  2.13s/it]tensor([[-2.6875, -3.4688, -1.3906,  3.2656, -0.2197]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1250,  0.8906,  2.5156, -2.5469, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.3125, -5.1562,  0.6758,  1.3828, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -0.6484,  3.0781, -1.0938, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:09,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.98 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.9375, -3.7500,  0.7031,  2.6719, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5625, -1.0312,  2.7812,  1.6719, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -1.2891,  2.6562, -0.7266, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0000, -4.9375,  0.8555,  1.9375, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:10,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:41:10,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.79 | bwd_microstep: 2.73 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 1.83 | step_microstep: 2.80
[2025-11-06 18:41:10,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.80 | bwd: 3.65 | bwd_inner: 1.59 | bwd_allreduce: 1.89 | step: 2.91
 65%|██████▌   | 2293/3507 [56:23<33:15,  1.64s/it]                                                   {'loss': 0.6595, 'learning_rate': 5.655813205152678e-06, 'epoch': 0.65}
 65%|██████▌   | 2293/3507 [56:23<33:15,  1.64s/it]tensor([[-4.0938, -0.2871,  1.1641, -3.3438, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4531, -2.2344,  0.3438,  1.7031, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:10,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.65 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.8125, -2.8438,  1.1797,  1.5625, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -4.6250, -1.5312,  1.7656, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7500, -0.3457,  1.6875, -1.8359, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -4.2812, -0.7227,  1.9922, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.0312, -0.7148,  2.5781, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -0.8867,  2.1250,  0.4766, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:13,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:41:13,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.06 | bwd_microstep: 3022.33 | bwd_inner_microstep: 8.06 | bwd_allreduce_microstep: 3014.17 | step_microstep: 2.16
[2025-11-06 18:41:13,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.72 | bwd: 3023.33 | bwd_inner: 8.95 | bwd_allreduce: 3014.23 | step: 2.26
 65%|██████▌   | 2294/3507 [56:27<43:43,  2.16s/it]                                                   {'loss': 0.4048, 'learning_rate': 5.64749495608837e-06, 'epoch': 0.65}
 65%|██████▌   | 2294/3507 [56:27<43:43,  2.16s/it]tensor([[-3.6406, -3.1250, -1.0156,  1.2422, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0625, -2.1875, -0.3262,  2.6875, -0.2490]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.2188,  0.2988,  2.6406, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.0625,  0.5938,  1.8750, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:13,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.62 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1250, -2.0781,  1.7891,  4.0312, -1.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6875, -3.0000, -0.9023,  2.6875, -0.5508]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.3125, -0.1611,  2.6562, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -3.4219,  0.3711,  2.1250, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:13,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:41:13,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.42 | bwd_microstep: 53.86 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 52.70 | step_microstep: 1.57
[2025-11-06 18:41:13,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.07 | bwd: 54.94 | bwd_inner: 2.07 | bwd_allreduce: 52.74 | step: 1.66
 65%|██████▌   | 2295/3507 [56:27<33:10,  1.64s/it]                                                   {'loss': 0.6938, 'learning_rate': 5.639180420885817e-06, 'epoch': 0.65}
 65%|██████▌   | 2295/3507 [56:27<33:10,  1.64s/it]tensor([[-1.7500, -1.0781,  1.1484,  3.1562, -0.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2188, -1.1641,  1.3438,  2.5625, -0.9727]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:14,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.18 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0469,  1.7031,  3.1562, -1.2031, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -0.7812,  3.1094,  0.2432, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -2.2969,  1.5703,  1.5625, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -3.0938,  0.8203,  1.6484, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -0.3184,  3.5156, -2.3906, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -1.6484,  2.7656,  0.5117, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:15,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.33
[2025-11-06 18:41:15,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.91 | bwd_microstep: 1160.58 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 1159.20 | step_microstep: 2.49
[2025-11-06 18:41:15,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.12 | bwd: 1161.41 | bwd_inner: 2.03 | bwd_allreduce: 1159.25 | step: 2.56
 65%|██████▌   | 2296/3507 [56:29<32:37,  1.62s/it]                                                   {'loss': 0.6033, 'learning_rate': 5.630869606639566e-06, 'epoch': 0.65}
 65%|██████▌   | 2296/3507 [56:29<32:37,  1.62s/it]tensor([[-4.0312, -2.1719,  1.0547,  1.5391, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:15,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.14 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6875, -2.3594,  1.3047,  4.7188, -0.5742]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -3.1875,  1.5938,  1.8516, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -0.6953,  3.9219, -0.8281, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8750, -3.7188,  0.9336,  1.1016, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -5.1562, -0.9688,  2.2188, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2500, -4.8438,  0.1787,  2.2812, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -4.9062,  0.9023,  2.4375, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:41:16,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:41:16,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.88 | bwd_microstep: 810.45 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 809.43 | step_microstep: 1.55
[2025-11-06 18:41:16,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.03 | bwd: 811.38 | bwd_inner: 1.76 | bwd_allreduce: 809.47 | step: 1.64
 65%|██████▌   | 2297/3507 [56:30<30:29,  1.51s/it]                                                   {'loss': 0.2656, 'learning_rate': 5.622562520440977e-06, 'epoch': 0.65}
 65%|██████▌   | 2297/3507 [56:30<30:29,  1.51s/it]tensor([[-4.7188, -2.5938,  1.2656,  1.0859, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -3.7656,  1.6875, -0.3477, -5.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -3.3594,  1.0234,  0.8750, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -4.1875, -0.4473,  3.7031, -1.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.6875, -9.0000, -4.9062,  0.3105, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:16,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.32 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2188, -3.9531, -0.7539,  2.6094, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -3.7969,  0.6758,  1.5859, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -4.1250,  0.2227,  1.6094, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:41:17,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.28 | optimizer_step: 0.25
[2025-11-06 18:41:17,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.81 | bwd_microstep: 29.67 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 28.51 | step_microstep: 2.63
[2025-11-06 18:41:17,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.16 | bwd: 30.63 | bwd_inner: 1.88 | bwd_allreduce: 28.56 | step: 2.73
 66%|██████▌   | 2298/3507 [56:30<23:56,  1.19s/it]                                                   {'loss': 0.2738, 'learning_rate': 5.614259169378251e-06, 'epoch': 0.66}
 66%|██████▌   | 2298/3507 [56:30<23:56,  1.19s/it]tensor([[-4.1562, -4.3750, -0.7852,  4.0625, -1.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719,  1.5859,  2.3906, -2.4062, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:17,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.13 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-3.5312, -4.4062, -2.6094,  1.8672, -0.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -2.4688,  1.6797, -0.2891, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -0.6680,  2.4219, -0.3438, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1250, -5.2188,  0.5273,  1.8984, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0156,  1.3438,  2.6406, -1.6953, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -1.8047,  2.9688,  0.0776, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:41:19,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.99 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:41:19,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.20 | bwd_microstep: 838.08 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 836.86 | step_microstep: 4.50
[2025-11-06 18:41:19,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 437.37 | bwd: 839.12 | bwd_inner: 2.05 | bwd_allreduce: 836.92 | step: 4.61
 66%|██████▌   | 2299/3507 [56:32<29:14,  1.45s/it]                                                   {'loss': 0.1603, 'learning_rate': 5.605959560536376e-06, 'epoch': 0.66}
 66%|██████▌   | 2299/3507 [56:32<29:14,  1.45s/it]tensor([[-5.9688, -3.6406,  1.3594,  1.4609, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7812,  0.3008,  3.3125, -1.2188, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.6094, -0.1631,  2.4062, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:19,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.73 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5312, -3.6406,  0.9414,  1.7422, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -1.7344,  1.6562, -1.0938, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -3.4531,  0.1680,  3.5625, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -2.9375,  0.7695, -1.5469, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[h264 @ 0x18bf5300] mmco: unref short failure
[h264 @ 0x18bf5300] mmco: unref short failure
tensor([[-1.7812, -2.4375, -0.6328,  3.4062,  0.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:21,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.36 | optimizer_step: 0.34
[2025-11-06 18:41:21,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.23 | bwd_microstep: 1540.85 | bwd_inner_microstep: 3.00 | bwd_allreduce_microstep: 1537.63 | step_microstep: 3.39
[2025-11-06 18:41:21,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.99 | bwd: 1541.73 | bwd_inner: 3.80 | bwd_allreduce: 1537.70 | step: 3.48
 66%|██████▌   | 2300/3507 [56:34<32:18,  1.61s/it]                                                   {'loss': 0.4897, 'learning_rate': 5.5976637009971634e-06, 'epoch': 0.66}
 66%|██████▌   | 2300/3507 [56:34<32:18,  1.61s/it]tensor([[-5.6562, -5.0938, -0.5312,  2.8906, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:21,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.32 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.80 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.17
tensor([[-1.5547,  1.2891,  2.2656, -0.6836, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -2.9062,  0.9219,  1.4609, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062,  0.6758,  3.7031, -1.0469, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5938, -4.0625,  0.3086, -0.4453, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -3.7031,  0.6523,  2.3281, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500,  0.3184,  2.9062, -1.9688, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -4.7188,  0.0127,  2.9688, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:41:22,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:41:22,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.16 | bwd_microstep: 1192.37 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1191.21 | step_microstep: 2.27
[2025-11-06 18:41:22,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.53 | bwd: 1194.42 | bwd_inner: 2.86 | bwd_allreduce: 1191.30 | step: 2.44
 66%|██████▌   | 2301/3507 [56:36<31:46,  1.58s/it]                                                   {'loss': 0.3017, 'learning_rate': 5.589371597839215e-06, 'epoch': 0.66}
 66%|██████▌   | 2301/3507 [56:36<31:46,  1.58s/it]tensor([[-1.6406,  1.9531,  2.9531, -1.8359, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -4.1562,  0.1416,  2.1250, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.8125, -5.3125,  1.0859,  1.6094, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:22,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.75 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1562, -0.8008,  2.9531, -2.0781, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.8125, -3.4844,  2.8594,  1.4141, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6484, -2.0938,  0.1875,  4.3438,  0.4492]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -1.1797,  2.8125, -1.7578, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6250,  0.5625,  1.9609, -1.4922, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:24,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.26 | optimizer_step: 0.25
[2025-11-06 18:41:24,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.90 | bwd_microstep: 1005.30 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 1003.63 | step_microstep: 3.45
[2025-11-06 18:41:24,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.67 | bwd: 1006.30 | bwd_inner: 2.42 | bwd_allreduce: 1003.67 | step: 3.53
 66%|██████▌   | 2302/3507 [56:37<30:35,  1.52s/it]                                                   {'loss': 0.253, 'learning_rate': 5.581083258137943e-06, 'epoch': 0.66}
 66%|██████▌   | 2302/3507 [56:37<30:35,  1.52s/it]tensor([[-4.1250, -1.5078,  2.0469,  0.7266, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:24,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.55 | bwd_microstep: 2.09 | bwd_inner_microstep: 1.82 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.21
tensor([[-4.5000, -1.3516,  2.2656, -0.0811, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -5.2188, -1.6641,  2.8906, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -1.6797,  2.9062, -0.5312, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -5.5625, -1.9219,  2.0000, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -5.0000, -1.1797,  3.2188, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -4.7812,  0.8203,  3.2344, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -4.8125, -1.4297,  2.6562, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:24,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.29 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 18:41:24,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.05 | bwd_microstep: 79.75 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 78.51 | step_microstep: 3.93
[2025-11-06 18:41:24,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.65 | bwd: 81.84 | bwd_inner: 2.97 | bwd_allreduce: 78.61 | step: 4.14
 66%|██████▌   | 2303/3507 [56:38<24:16,  1.21s/it]                                                   {'loss': 0.3692, 'learning_rate': 5.572798688965539e-06, 'epoch': 0.66}
 66%|██████▌   | 2303/3507 [56:38<24:16,  1.21s/it]tensor([[-5.3438, -3.9531,  0.3984,  1.8906, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.8281,  0.7773,  2.1406, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -2.0156,  1.9453,  0.1895, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:24,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.34 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.6562, -3.0938,  0.9805,  0.0361, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7266,  2.5156,  2.3750, -3.0938, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.8750, -4.1562,  1.0391,  0.3496, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938,  0.0211,  3.8594, -0.3574, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9688, -1.8125,  3.1875, -0.6836, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:26,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.97 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:41:26,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.88 | bwd_microstep: 1938.34 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1937.13 | step_microstep: 5.10
[2025-11-06 18:41:26,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 472.26 | bwd: 1939.52 | bwd_inner: 2.15 | bwd_allreduce: 1937.19 | step: 5.21
 66%|██████▌   | 2304/3507 [56:40<31:49,  1.59s/it]                                                   {'loss': 0.548, 'learning_rate': 5.564517897390962e-06, 'epoch': 0.66}
 66%|██████▌   | 2304/3507 [56:40<31:49,  1.59s/it]tensor([[-2.8906,  0.5312,  3.0000, -0.4609, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -5.2188, -0.9609,  1.4375, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:27,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.65 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.0938,  0.4316,  3.3125, -0.4160, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[ 0.6094,  3.6562,  4.0938,  0.6680, -0.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0625,  2.3906,  3.6406, -0.6680, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.1875, -4.3438,  1.5000,  0.8242, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031,  0.7383,  3.0000, -0.5352, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3594,  0.0933,  1.7500,  0.2871, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:27,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:41:27,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.72 | bwd_microstep: 3.96 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 2.78 | step_microstep: 3.09
[2025-11-06 18:41:27,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.40 | bwd: 4.91 | bwd_inner: 1.90 | bwd_allreduce: 2.83 | step: 3.21
 66%|██████▌   | 2305/3507 [56:41<24:44,  1.23s/it]                                                   {'loss': 0.4101, 'learning_rate': 5.556240890479978e-06, 'epoch': 0.66}
 66%|██████▌   | 2305/3507 [56:41<24:44,  1.23s/it]tensor([[-5.3125, -2.9531,  0.9453,  0.1543, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:27,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.20 | bwd_microstep: 1.31 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.17
tensor([[-4.5938, -1.3125,  1.5078, -1.1875, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -3.9219, -1.6953,  1.3828, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -0.0649,  1.7500, -3.3750, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6875, -5.4688, -3.7500,  0.5469, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.1875, -2.1875,  1.6953,  1.7500, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6094, -3.9219, -1.5312,  2.5312, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -4.0938,  0.5586,  2.2969, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:27,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 18:41:27,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.94 | bwd_microstep: 180.72 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 179.50 | step_microstep: 2.05
[2025-11-06 18:41:27,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.16 | bwd: 182.05 | bwd_inner: 2.30 | bwd_allreduce: 179.57 | step: 2.23
 66%|██████▌   | 2306/3507 [56:41<20:40,  1.03s/it]                                                   {'loss': 1.0742, 'learning_rate': 5.547967675295102e-06, 'epoch': 0.66}
 66%|██████▌   | 2306/3507 [56:41<20:40,  1.03s/it]tensor([[-1.8281,  2.1562,  2.3750, -2.8281, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.3906,  0.6172,  2.0469, -3.1250, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:41:28,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.77 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.5156, -4.1562, -1.5547,  3.1250, -0.8398]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4375, -5.0000,  0.7148,  0.9375, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -0.5547,  2.1250, -0.9805, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -3.1562,  1.0625,  0.3438, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5312, -3.2969,  1.2109,  3.2812, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -3.0938,  0.4883, -0.7031, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:41:30,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.62 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:41:30,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.87 | bwd_microstep: 1671.02 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1669.84 | step_microstep: 4.35
[2025-11-06 18:41:30,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.64 | bwd: 1672.02 | bwd_inner: 2.00 | bwd_allreduce: 1669.89 | step: 4.45
 66%|██████▌   | 2307/3507 [56:43<26:55,  1.35s/it]                                                   {'loss': 0.5327, 'learning_rate': 5.53969825889562e-06, 'epoch': 0.66}
 66%|██████▌   | 2307/3507 [56:43<26:55,  1.35s/it]tensor([[-6.7812, -4.7500,  1.1484,  2.2031, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -2.9844,  0.5742,  0.2949, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:30,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.67 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4375,  0.6602,  2.3594, -2.8750, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -3.1719,  2.3125,  1.3203, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -2.0781,  1.9453,  1.4453, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -1.6172,  1.3359, -0.4180, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -2.8125,  0.6875,  0.8438, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.0312,  1.3594,  1.9375, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:41:31,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:41:31,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.99 | bwd_microstep: 757.01 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 755.96 | step_microstep: 3.54
[2025-11-06 18:41:31,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.66 | bwd: 757.89 | bwd_inner: 1.75 | bwd_allreduce: 756.01 | step: 3.63
 66%|██████▌   | 2308/3507 [56:44<25:31,  1.28s/it]                                                   {'loss': 0.3978, 'learning_rate': 5.531432648337578e-06, 'epoch': 0.66}
 66%|██████▌   | 2308/3507 [56:44<25:31,  1.28s/it]tensor([[-4.2188, -1.8125,  0.6992, -0.9141, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5625, -3.9062,  1.8516, -0.6445, -6.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:31,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.07 | bwd_microstep: 1.29 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-2.7188,  1.3516,  3.5625, -1.6562, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8594,  0.5234,  2.0000, -1.6016, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3750, -5.7188, -2.0625,  0.5547, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -1.1719,  2.1406, -0.6133, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281,  0.0698,  1.9609, -1.5781, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -3.6406,  1.0391,  2.5000, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:41:33,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:41:33,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.53 | bwd_microstep: 2161.00 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 2159.48 | step_microstep: 2.50
[2025-11-06 18:41:33,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.65 | bwd: 2162.29 | bwd_inner: 2.53 | bwd_allreduce: 2159.56 | step: 2.64
 66%|██████▌   | 2309/3507 [56:47<32:57,  1.65s/it]                                                   {'loss': 0.1587, 'learning_rate': 5.523170850673772e-06, 'epoch': 0.66}
 66%|██████▌   | 2309/3507 [56:47<32:57,  1.65s/it]tensor([[-6.3125, -6.0000, -1.9375,  1.7109, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -2.6562,  0.2148, -0.0625, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:33,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.17 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.5000, -3.1719,  1.5938, -0.7422, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -5.0000, -0.2988,  2.6406, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -4.5625, -1.3516,  2.7188, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -2.4219,  1.4062,  0.8438, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -4.3438, -2.1719,  2.4062, -0.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5000, -6.4375, -1.1406,  1.9062, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:41:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.45 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.87 | step_microstep: 1.62
[2025-11-06 18:41:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.64 | bwd: 2.98 | bwd_inner: 1.92 | bwd_allreduce: 0.92 | step: 1.72
 66%|██████▌   | 2310/3507 [56:47<25:56,  1.30s/it]                                                   {'loss': 0.2793, 'learning_rate': 5.514912872953746e-06, 'epoch': 0.66}
 66%|██████▌   | 2310/3507 [56:47<25:56,  1.30s/it]tensor([[-3.3438, -3.2031,  0.0101,  3.5156, -1.1016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -4.3125,  0.1494,  3.0625, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.8750,  0.4707,  3.4844, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:34,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.23 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.2812, -5.9062, -0.2637,  2.1719, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -4.7188, -0.0415,  2.7188, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -4.0000, -0.6250,  2.7031, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -4.1562,  0.5977,  0.9766, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.5781,  1.3047,  0.6914, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:41:36,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.26 | optimizer_step: 0.28
[2025-11-06 18:41:36,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.74 | bwd_microstep: 1670.63 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1669.37 | step_microstep: 3.16
[2025-11-06 18:41:36,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.00 | bwd: 1671.78 | bwd_inner: 2.22 | bwd_allreduce: 1669.42 | step: 3.24
 66%|██████▌   | 2311/3507 [56:50<30:45,  1.54s/it]                                                   {'loss': 0.2132, 'learning_rate': 5.5066587222237845e-06, 'epoch': 0.66}
 66%|██████▌   | 2311/3507 [56:50<30:45,  1.54s/it]tensor([[-3.6406, -2.7188,  0.4648,  2.5156, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.3125, -2.6406,  1.9922,  0.9453, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([3], device='cuda:1')
 tensor([2], device='cuda:2')
tensor([[4.8438, 5.2812, 6.7188, 8.4375, 4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -2.5938,  1.4688, -0.8516, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:36,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.86 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.22
tensor([[-4.6875, -4.1875, -0.0078,  3.3125, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -2.6719,  0.8203,  0.0320, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.8750, -4.5000,  0.1943, -1.7656, -6.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.9062,  2.1094,  2.2812, -1.4766, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:36,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:41:36,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.87 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.67
[2025-11-06 18:41:36,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 500.79 | bwd: 2.94 | bwd_inner: 1.87 | bwd_allreduce: 0.87 | step: 3.89
 66%|██████▌   | 2312/3507 [56:50<24:53,  1.25s/it]                                                   {'loss': 0.3549, 'learning_rate': 5.498408405526905e-06, 'epoch': 0.66}
 66%|██████▌   | 2312/3507 [56:50<24:53,  1.25s/it]tensor([[-2.8750,  0.4238,  2.5469, -0.2461, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -2.8750,  1.6562,  2.8906, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -5.8125, -2.9844,  1.8125, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -4.5000, -0.1865,  2.0781, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0625, -2.4219,  2.6875, -0.2363, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3438, -7.4375, -3.0469,  1.7812, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -4.7500, -0.6367,  1.9922, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:41,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6719, -2.7812,  0.3867,  1.9531, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:42,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:41:42,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 437.99 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.55
[2025-11-06 18:41:42,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 628.69 | bwd: 2.93 | bwd_inner: 1.96 | bwd_allreduce: 0.85 | step: 2.64
 66%|██████▌   | 2313/3507 [56:55<49:05,  2.47s/it]                                                   {'loss': 0.5956, 'learning_rate': 5.490161929902853e-06, 'epoch': 0.66}
 66%|██████▌   | 2313/3507 [56:55<49:05,  2.47s/it]tensor([[-5.3438, -4.3125, -0.1875,  1.9844, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:42,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.16 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.1562, -5.0000, -0.2461,  2.0312, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -1.3672,  3.7031, -0.4648, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-9.5625, -8.3125, -2.6875,  0.2520, -6.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -1.8203,  2.3750,  0.0408, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.4062, -5.1562,  0.7578,  1.5312, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -3.6250, -1.9531,  1.7422, -0.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.1875,  0.9570,  3.0938, -2.3594, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:41:42,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:41:42,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.22 | bwd_microstep: 68.84 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 67.70 | step_microstep: 1.41
[2025-11-06 18:41:42,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.40 | bwd: 69.67 | bwd_inner: 1.82 | bwd_allreduce: 67.74 | step: 1.48
 66%|██████▌   | 2314/3507 [56:56<36:53,  1.86s/it]                                                   {'loss': 0.8393, 'learning_rate': 5.481919302388108e-06, 'epoch': 0.66}
 66%|██████▌   | 2314/3507 [56:56<36:53,  1.86s/it]tensor([[-5.9688, -2.4844,  2.5156, -0.0967, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -5.6250, -1.9219,  2.9375, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0000, -6.0000, -1.7734,  1.1016, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:42,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0938, -0.2412,  4.0000, -2.1875, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4297,  1.7812,  2.3906, -1.2656, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.6875, -4.0938, -1.9688,  2.1250, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562, -3.5781, -0.4922,  4.0938, -0.5977]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6250, -3.2344,  2.2812,  0.1943, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:41:43,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:41:43,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.15 | bwd_microstep: 166.26 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 164.97 | step_microstep: 1.53
[2025-11-06 18:41:43,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.51 | bwd: 167.14 | bwd_inner: 2.02 | bwd_allreduce: 165.00 | step: 1.60
 66%|██████▌   | 2315/3507 [56:56<28:59,  1.46s/it]                                                   {'loss': 0.1807, 'learning_rate': 5.4736805300158455e-06, 'epoch': 0.66}
 66%|██████▌   | 2315/3507 [56:56<28:59,  1.46s/it]tensor([[-3.1094,  0.2256,  2.7969, -0.2207, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7500, -4.3125, -2.3281,  1.8203, -1.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[3.8594, 5.2812, 6.5000, 6.4375, 3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:43,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.44 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.6250, -5.9062, -0.3652,  3.5000, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3750,  1.1641,  2.4688, -1.3984, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656,  1.0859,  3.3750, -2.1094, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0625, -5.2500, -0.5508,  0.4590, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -3.4375, -1.2500,  2.5000, -0.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:43,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:41:43,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.91 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.62 | step_microstep: 1.35
[2025-11-06 18:41:43,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.38 | bwd: 2.73 | bwd_inner: 1.97 | bwd_allreduce: 0.64 | step: 1.42
 66%|██████▌   | 2316/3507 [56:57<22:45,  1.15s/it]                                                   {'loss': 0.3536, 'learning_rate': 5.465445619815965e-06, 'epoch': 0.66}
 66%|██████▌   | 2316/3507 [56:57<22:45,  1.15s/it]tensor([[-5.7188, -4.9062, -0.2246,  2.7500, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.3281,  0.0396,  1.0156, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -1.1562,  2.1719, -2.4844, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500,  0.3398,  4.1250, -1.0938, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -5.1250, -0.7695,  1.8828, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438,  0.0903,  3.0938, -2.3125, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.9062, -0.7383,  2.4219, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:45,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.26 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0938, -4.5312,  0.9180,  0.7227, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:46,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:41:46,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.73 | bwd_microstep: 2.54 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 1.09 | step_microstep: 3.01
[2025-11-06 18:41:46,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 475.02 | bwd: 3.61 | bwd_inner: 2.30 | bwd_allreduce: 1.13 | step: 3.09
 66%|██████▌   | 2317/3507 [57:00<32:40,  1.65s/it]                                                   {'loss': 0.1575, 'learning_rate': 5.457214578815068e-06, 'epoch': 0.66}
 66%|██████▌   | 2317/3507 [57:00<32:40,  1.65s/it]tensor([[-4.4688, -4.9688, -2.2031,  2.2188, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:46,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.82 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0469,  2.0938,  2.7812, -2.8594, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -3.6094,  1.1250,  2.0469, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6719, -0.1377,  3.0156,  3.5625, -0.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -4.9375, -1.3672,  2.9844, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0938,  1.3828,  4.3125,  3.3281, -0.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -2.3438,  2.6094,  0.7148, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -3.8750, -0.2539,  3.6094, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:46,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:41:46,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.64 | bwd_microstep: 180.27 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 179.09 | step_microstep: 1.98
[2025-11-06 18:41:46,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.49 | bwd: 181.11 | bwd_inner: 1.85 | bwd_allreduce: 179.13 | step: 2.05
 66%|██████▌   | 2318/3507 [57:00<26:04,  1.32s/it]                                                   {'loss': 0.4684, 'learning_rate': 5.448987414036457e-06, 'epoch': 0.66}
 66%|██████▌   | 2318/3507 [57:00<26:04,  1.32s/it]tensor([[-5.3750, -2.0312,  2.3594, -0.0869, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:47,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.64 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.5000, -4.4062,  1.6016,  0.6211, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -2.5625,  2.1406,  0.6562, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0000, -1.8516,  0.7812,  3.9062, -0.1084]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2734,  2.7031,  3.1719, -2.3594, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.1406,  0.2334,  1.6172,  0.2832, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.4219,  1.8828,  2.2031, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -3.1719,  1.5859,  3.3125, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:41:48,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.27
[2025-11-06 18:41:48,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.30 | bwd_microstep: 925.10 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 923.96 | step_microstep: 2.09
[2025-11-06 18:41:48,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.95 | bwd: 925.83 | bwd_inner: 1.67 | bwd_allreduce: 924.01 | step: 2.17
 66%|██████▌   | 2319/3507 [57:02<26:58,  1.36s/it]                                                   {'loss': 0.5111, 'learning_rate': 5.440764132500125e-06, 'epoch': 0.66}
 66%|██████▌   | 2319/3507 [57:02<26:58,  1.36s/it]tensor([[-4.4062, -3.7969, -0.6289,  1.9844, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:48,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.54 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.3438, -6.4375, -1.6953,  1.6016, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281,  0.4883,  2.5625, -2.8125, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -1.2578,  3.0156, -1.0000, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9062, -5.7500, -1.0469,  1.4531, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.7500,  0.0645,  1.9922, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -5.1875, -1.5703,  2.7812, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -3.5000,  0.7305,  2.8906, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:50,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:41:50,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.94 | bwd_microstep: 1516.78 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1515.54 | step_microstep: 2.25
[2025-11-06 18:41:50,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.50 | bwd: 1517.78 | bwd_inner: 2.06 | bwd_allreduce: 1515.60 | step: 2.34
 66%|██████▌   | 2320/3507 [57:04<29:50,  1.51s/it]                                                   {'loss': 0.3227, 'learning_rate': 5.43254474122275e-06, 'epoch': 0.66}
 66%|██████▌   | 2320/3507 [57:04<29:50,  1.51s/it]tensor([[-3.8750, -0.2393,  2.4531, -1.5000, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0625, -7.1250, -3.0625,  1.2734, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -1.2266,  2.2344,  1.1484, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -2.4062,  1.5938,  3.3125, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:50,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.81 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.4531, -0.1445,  1.3047, -0.3984, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7188, -3.3438,  1.7344,  3.6562, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.8438,  0.2539,  2.1719, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[2.4531, 4.0312, 5.5938, 5.3750, 2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:50,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:41:50,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.43 | bwd_microstep: 35.90 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 34.69 | step_microstep: 1.77
[2025-11-06 18:41:50,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.26 | bwd: 36.97 | bwd_inner: 2.09 | bwd_allreduce: 34.73 | step: 1.85
 66%|██████▌   | 2321/3507 [57:04<23:30,  1.19s/it]                                                   {'loss': 0.6993, 'learning_rate': 5.424329247217688e-06, 'epoch': 0.66}
 66%|██████▌   | 2321/3507 [57:04<23:30,  1.19s/it]tensor([[-6.5312, -4.3438,  1.1094,  1.6328, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -1.4375,  3.6250, -1.2422, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:50,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.38 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-8.0000, -6.7500, -2.1094,  0.1147, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -4.2500,  0.3262,  3.4531, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -3.1875,  0.9102,  0.7500, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -3.5156,  0.5859,  1.8125, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -1.0234,  2.3594, -0.4492, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -0.4590,  3.0781, -0.8203, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:52,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:41:52,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.15 | bwd_microstep: 1227.24 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1225.98 | step_microstep: 1.68
[2025-11-06 18:41:52,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.56 | bwd: 1228.30 | bwd_inner: 2.07 | bwd_allreduce: 1226.03 | step: 1.81
 66%|██████▌   | 2322/3507 [57:06<25:52,  1.31s/it]                                                   {'loss': 0.2296, 'learning_rate': 5.416117657494977e-06, 'epoch': 0.66}
 66%|██████▌   | 2322/3507 [57:06<25:52,  1.31s/it]tensor([[-5.3438, -2.8594,  2.5312,  2.3594, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -4.6250, -1.0391,  2.3906, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4688, -4.4062,  0.0737,  0.5273, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:52,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.23 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.6875, -3.6875,  0.3789,  0.6602, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5312, -4.7812, -0.0084,  1.0703, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7383,  2.6250,  2.8906, -1.4062, -1.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -4.5625,  0.6562,  0.5742, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -4.9062, -0.4785,  4.0000, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:54,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:41:54,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.21 | bwd_microstep: 1586.05 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 1584.76 | step_microstep: 2.07
[2025-11-06 18:41:54,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.48 | bwd: 1586.96 | bwd_inner: 2.02 | bwd_allreduce: 1584.79 | step: 2.14
 66%|██████▌   | 2323/3507 [57:08<30:53,  1.57s/it]                                                   {'loss': 0.6036, 'learning_rate': 5.407909979061319e-06, 'epoch': 0.66}
 66%|██████▌   | 2323/3507 [57:08<30:53,  1.57s/it]tensor([[-2.2656,  1.6562,  3.9219, -0.9570, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -4.4688, -0.5352,  3.3438, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -5.1875, -0.9453,  2.5312, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:54,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.96 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6875, -1.9922,  1.8281,  0.6914, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -2.4531,  1.2734,  0.7930, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8438, -3.1250,  2.8438,  0.1699, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -6.1875, -2.0625,  2.0938, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.4062, -4.0000,  1.6328, -0.2871, -5.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:54,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:41:54,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.60 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.45
[2025-11-06 18:41:54,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 539.57 | bwd: 2.72 | bwd_inner: 1.77 | bwd_allreduce: 0.81 | step: 1.53
 66%|██████▋   | 2324/3507 [57:08<25:04,  1.27s/it]                                                   {'loss': 0.2103, 'learning_rate': 5.399706218920078e-06, 'epoch': 0.66}
 66%|██████▋   | 2324/3507 [57:08<25:04,  1.27s/it]tensor([[-4.8438, -1.5625,  1.4844, -1.3594, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625, -3.4375, -1.3672,  1.3203, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:55,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.66 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2500, -3.3906,  1.0703,  2.1875, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -4.1250, -0.3125,  1.9531, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688,  0.7461,  4.1250, -1.7969, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -2.7656,  1.8047,  1.2344, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1719, -2.5000,  0.0693,  1.8516, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.5000, -3.7656,  1.0703,  2.2500, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:41:56,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 18:41:56,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.20 | bwd_microstep: 1158.01 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1156.92 | step_microstep: 2.10
[2025-11-06 18:41:56,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.89 | bwd: 1158.94 | bwd_inner: 1.84 | bwd_allreduce: 1156.97 | step: 2.19
 66%|██████▋   | 2325/3507 [57:10<26:39,  1.35s/it]                                                   {'loss': 0.9135, 'learning_rate': 5.391506384071278e-06, 'epoch': 0.66}
 66%|██████▋   | 2325/3507 [57:10<26:39,  1.35s/it]tensor([[-3.8906, -4.2812, -1.6172,  2.7969, -1.2109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -4.2812, -0.0508,  1.7266, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3125, -5.4688,  0.6523,  0.3691, -6.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:41:56,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.03 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.4375, -6.4688, -2.7656,  1.5391, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -5.0312, -1.6484,  2.5625, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.3125, -7.4688, -1.2422,  0.6523, -6.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3438, -1.0938,  1.5312,  0.6836, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -2.8125,  0.7266, -1.8828, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:41:57,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:41:57,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.98 | bwd_microstep: 385.09 | bwd_inner_microstep: 2.08 | bwd_allreduce_microstep: 382.92 | step_microstep: 1.93
[2025-11-06 18:41:57,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.04 | bwd: 385.92 | bwd_inner: 2.83 | bwd_allreduce: 382.95 | step: 2.01
 66%|██████▋   | 2326/3507 [57:11<23:10,  1.18s/it]                                                   {'loss': 0.2192, 'learning_rate': 5.38331048151159e-06, 'epoch': 0.66}
 66%|██████▋   | 2326/3507 [57:11<23:10,  1.18s/it]tensor([[-5.5312, -4.7812, -0.8008,  1.6406, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2734, -0.3164,  2.6250,  6.2500,  1.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -2.8281,  1.0547,  0.8672, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:57,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.78 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -2.8125,  2.5625,  1.1172, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -4.4375, -1.7109,  2.2656, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -3.0938,  0.7578,  4.1875, -1.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -0.6328,  3.5781, -0.5078, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -1.0312,  2.7344, -0.4590, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:57,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:41:57,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.81 | bwd_microstep: 1.41 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.63 | step_microstep: 8.45
[2025-11-06 18:41:57,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.62 | bwd: 2.27 | bwd_inner: 1.47 | bwd_allreduce: 0.66 | step: 8.53
 66%|██████▋   | 2327/3507 [57:11<18:46,  1.05it/s]                                                   {'loss': 0.143, 'learning_rate': 5.3751185182343326e-06, 'epoch': 0.66}
 66%|██████▋   | 2327/3507 [57:11<18:46,  1.05it/s]tensor([[-2.7031,  1.1172,  3.7188, -0.4336, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.7812, -5.9062,  0.3320,  2.0156, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-10.1250,  -8.1875,  -1.6875,   0.1245,  -6.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1562, -1.6953,  1.2344,  1.8750, -1.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -1.4062,  2.7500, -0.5859, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -5.4688, -0.4375,  0.9102, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656,  0.0913,  3.0312, -1.2031, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:41:59,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.28 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.13
tensor([[-3.4062, -3.4062, -0.1748,  3.6094, -1.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:00,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:42:00,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.58 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.86
[2025-11-06 18:42:00,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.87 | bwd: 3.13 | bwd_inner: 2.02 | bwd_allreduce: 0.94 | step: 2.99
 66%|██████▋   | 2328/3507 [57:14<27:35,  1.40s/it]                                                   {'loss': 0.3265, 'learning_rate': 5.366930501229459e-06, 'epoch': 0.66}
 66%|██████▋   | 2328/3507 [57:14<27:35,  1.40s/it]tensor([[-3.2969,  1.4844,  4.3750, -2.1719, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:00,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.76 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9219, -4.0312, -0.7734,  3.4062, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -2.7812,  0.8672,  1.3750, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0625,  0.8203,  4.0625,  1.8516, -1.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.6250, -5.0000,  1.2578,  1.5391, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -3.8906,  1.6016,  2.4062, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -2.7188,  2.5000,  1.9375, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0000, -3.9844,  2.0000,  0.6641, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:42:00,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:42:00,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.59 | bwd_microstep: 158.66 | bwd_inner_microstep: 1.56 | bwd_allreduce_microstep: 157.03 | step_microstep: 1.75
[2025-11-06 18:42:00,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.38 | bwd: 159.55 | bwd_inner: 2.34 | bwd_allreduce: 157.07 | step: 1.84
 66%|██████▋   | 2329/3507 [57:14<22:44,  1.16s/it]                                                   {'loss': 0.4615, 'learning_rate': 5.35874643748356e-06, 'epoch': 0.66}
 66%|██████▋   | 2329/3507 [57:14<22:44,  1.16s/it]tensor([[-4.7500, -0.1572,  3.3906, -2.1250, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2812, -5.8125, -0.0535,  2.1094, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9062, -5.6250, -0.8477,  1.3828, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625,  0.1855,  1.7891, -2.1875, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -1.5547,  3.4844, -0.6094, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -2.9531,  0.4473,  1.5312, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -2.7031,  0.3164,  2.1094, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:03,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.14 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7188, -3.7812,  0.6562,  3.1562, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:04,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:42:04,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 1.88 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.10
[2025-11-06 18:42:04,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.60 | bwd: 2.59 | bwd_inner: 1.69 | bwd_allreduce: 0.78 | step: 2.17
 66%|██████▋   | 2330/3507 [57:17<35:56,  1.83s/it]                                                   {'loss': 0.1319, 'learning_rate': 5.350566333979852e-06, 'epoch': 0.66}
 66%|██████▋   | 2330/3507 [57:17<35:56,  1.83s/it]tensor([[-4.4062, -2.4688,  1.3906,  1.6250, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -1.5703,  3.4531, -0.3848, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:04,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.89 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3125, -6.5625, -3.1094,  1.5859, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8750, -4.1875,  1.5156,  0.9062, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5312, -4.3750,  1.5234,  0.0266, -5.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5625, -6.5938, -3.0625, -1.2422, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7812,  0.9062,  2.9844, -1.3984, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5625, -4.2500,  1.6094,  2.0469, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:04,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:42:04,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.52 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.70
[2025-11-06 18:42:04,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 500.43 | bwd: 2.72 | bwd_inner: 1.71 | bwd_allreduce: 0.89 | step: 1.79
 66%|██████▋   | 2331/3507 [57:18<28:20,  1.45s/it]                                                   {'loss': 0.3459, 'learning_rate': 5.342390197698178e-06, 'epoch': 0.66}
 66%|██████▋   | 2331/3507 [57:18<28:20,  1.45s/it]tensor([[-4.9375, -4.2188,  0.3848,  3.6562, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.9375, -1.1328,  2.8594, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -0.6172,  3.1406, -0.6133, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -0.1455,  2.6562, -1.2422, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.0312,  0.7656,  2.7500, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -1.2734,  3.2500,  1.0234, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -2.0312,  2.7031, -0.8086, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:06,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.75 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4062, -4.0938, -1.7188,  2.8281, -0.8320]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:07,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:42:07,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.87 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.28
[2025-11-06 18:42:07,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 495.66 | bwd: 2.73 | bwd_inner: 1.72 | bwd_allreduce: 0.87 | step: 2.37
 66%|██████▋   | 2332/3507 [57:20<33:37,  1.72s/it]                                                   {'loss': 0.0652, 'learning_rate': 5.3342180356149756e-06, 'epoch': 0.66}
 66%|██████▋   | 2332/3507 [57:20<33:37,  1.72s/it]tensor([[-9.5000, -8.7500, -2.6406,  1.4844, -5.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500,  0.7422,  3.5781, -2.9531, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:07,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.33 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2500, -5.2812, -1.2031,  3.2656, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9688, -4.2812,  1.9844,  1.8125, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -3.6406,  0.5938,  0.0157, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -0.2793,  3.2188, -2.2344, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -1.4609,  1.8438,  0.3223, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -4.3750, -2.2188,  1.7578, -1.3828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:42:07,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:42:07,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.30 | bwd_microstep: 50.51 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 49.39 | step_microstep: 1.87
[2025-11-06 18:42:07,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.65 | bwd: 51.52 | bwd_inner: 1.96 | bwd_allreduce: 49.43 | step: 1.97
 67%|██████▋   | 2333/3507 [57:21<26:22,  1.35s/it]                                                   {'loss': 0.2086, 'learning_rate': 5.32604985470331e-06, 'epoch': 0.67}
 67%|██████▋   | 2333/3507 [57:21<26:22,  1.35s/it]tensor([[-5.2188, -2.6719,  1.2812,  0.2344, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -4.5625, -0.2227,  2.6875, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1875, -7.1562, -1.4062,  1.8828, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6250, -0.1816,  2.7500,  1.4844, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0156,  1.5156,  4.5000, -1.5156, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -2.8750,  1.5078,  1.8672, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:08,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.03 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.7188, -6.3125, -0.3164,  2.2969, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5000, -4.3438,  1.2500,  1.8984, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:42:08,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:42:08,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.02 | bwd_microstep: 319.80 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 318.77 | step_microstep: 1.99
[2025-11-06 18:42:08,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.08 | bwd: 321.01 | bwd_inner: 2.06 | bwd_allreduce: 318.82 | step: 2.07
 67%|██████▋   | 2334/3507 [57:22<25:58,  1.33s/it]                                                   {'loss': 0.351, 'learning_rate': 5.31788566193285e-06, 'epoch': 0.67}
 67%|██████▋   | 2334/3507 [57:22<25:58,  1.33s/it]tensor([[-5.8125, -6.0938, -2.2656,  2.3281, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -3.0781,  1.2188, -0.8750, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:09,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.67 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6250,  0.1377,  1.4609, -3.3594, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:1')
tensor([[-8.0000, -7.2500, -2.5469,  0.7422, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -3.2812,  1.5078,  1.5781, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -1.2656,  3.6406, -0.9531, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-10.4375,  -6.3750,  -0.3105,  -3.0312,  -8.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9375, -3.7500,  2.3125,  0.7617, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:09,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:42:09,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.83 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.11
[2025-11-06 18:42:09,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.53 | bwd: 2.85 | bwd_inner: 1.85 | bwd_allreduce: 0.85 | step: 2.18
 67%|██████▋   | 2335/3507 [57:23<21:01,  1.08s/it]                                                   {'loss': 0.8181, 'learning_rate': 5.309725464269852e-06, 'epoch': 0.67}
 67%|██████▋   | 2335/3507 [57:23<21:01,  1.08s/it]tensor([[-5.3125, -4.1250,  0.3594,  2.5156, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -4.2812, -1.1250,  3.1719, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:09,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.01 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.6250,  0.6914,  2.7812, -0.4902, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -1.3906,  2.4219,  0.1602, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -4.0000, -0.6875,  2.2656, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -0.2930,  2.2500, -2.2969, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.1562,  0.0160,  1.8906, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812, -0.9727,  1.7500,  1.0078, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:42:13,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.32 | optimizer_step: 0.43
[2025-11-06 18:42:13,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.17 | bwd_microstep: 2118.37 | bwd_inner_microstep: 10.70 | bwd_allreduce_microstep: 2107.55 | step_microstep: 3.54
[2025-11-06 18:42:13,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.20 | bwd: 2119.04 | bwd_inner: 11.28 | bwd_allreduce: 2107.61 | step: 3.64
 67%|██████▋   | 2336/3507 [57:26<36:27,  1.87s/it]                                                   {'loss': 1.0106, 'learning_rate': 5.3015692686771725e-06, 'epoch': 0.67}
 67%|██████▋   | 2336/3507 [57:26<36:27,  1.87s/it]tensor([[-5.3438, -5.0312, -1.0781,  2.6094, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -1.0391,  3.0312, -0.5781, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.8750,  0.1455,  2.9062, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7891,  0.3047,  2.4375,  1.2578, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.8125, -5.9375, -0.1572,  1.1406, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -4.0938,  0.2295, -0.3594, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2578e+00,  1.5564e-03,  2.6406e+00,  3.8594e+00, -1.4551e-01]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:15,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.43 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.4531,  1.1406,  3.6250, -0.5820, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:16,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.31 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:42:16,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.61 | bwd_microstep: 2.98 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 1.34 | step_microstep: 3.17
[2025-11-06 18:42:16,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 546.08 | bwd: 4.08 | bwd_inner: 2.51 | bwd_allreduce: 1.39 | step: 3.24
 67%|██████▋   | 2337/3507 [57:29<43:40,  2.24s/it]                                                   {'loss': 0.2015, 'learning_rate': 5.293417082114235e-06, 'epoch': 0.67}
 67%|██████▋   | 2337/3507 [57:29<43:40,  2.24s/it]tensor([[-5.7188, -4.6562, -0.2236,  2.2188, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750, -2.7031,  0.7266,  4.3750, -0.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-9.1875, -6.6562, -1.5312, -1.6328, -6.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -4.9375, -1.7109,  2.4688, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:16,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.35 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-5.5625, -4.0312,  0.5234,  1.7578, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4531, -3.0938, -1.5781,  2.6406, -0.1592]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-6.7500, -4.5625,  0.6836,  1.1484, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -4.6875,  0.9023,  2.9219, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:16,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.37 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:42:16,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 279.46 | bwd_microstep: 1.82 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 3.75
[2025-11-06 18:42:16,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 481.85 | bwd: 2.79 | bwd_inner: 1.81 | bwd_allreduce: 0.82 | step: 3.86
 67%|██████▋   | 2338/3507 [57:30<33:41,  1.73s/it]                                                   {'loss': 1.0468, 'learning_rate': 5.2852689115370685e-06, 'epoch': 0.67}
 67%|██████▋   | 2338/3507 [57:30<33:41,  1.73s/it]tensor([[-3.3750, -3.1719, -1.3125,  1.3906, -1.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -4.4375, -0.7812,  4.0312, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1250, -1.6797,  1.2031,  1.9531, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -4.4062, -0.5117,  0.5039, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -3.7344,  0.6758,  3.0938, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -1.8672,  2.0156, -0.1118, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -3.6094,  0.9609,  1.5625, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:18,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.57 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.6250, -3.1562,  2.3281,  0.1621, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:18,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:42:18,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.52 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 25.01
[2025-11-06 18:42:18,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 457.09 | bwd: 2.64 | bwd_inner: 1.72 | bwd_allreduce: 0.79 | step: 25.08
 67%|██████▋   | 2339/3507 [57:32<33:05,  1.70s/it]                                                   {'loss': 0.4765, 'learning_rate': 5.2771247638982556e-06, 'epoch': 0.67}
 67%|██████▋   | 2339/3507 [57:32<33:05,  1.70s/it]tensor([[-5.3438, -1.3828,  2.8594, -1.0000, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -3.8750,  0.8984,  1.3047, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -2.0469,  0.6445,  0.6328, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -5.1562, -1.6406,  3.0781, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:18,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.11 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4375, -1.6250,  2.5938, -1.2500, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -2.9375,  1.3828,  3.1094, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.8125, -3.9531,  1.2109, -2.1719, -6.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -1.7656,  2.3750, -1.4219, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:42:18,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:42:18,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.55 | bwd_microstep: 40.78 | bwd_inner_microstep: 1.96 | bwd_allreduce_microstep: 38.72 | step_microstep: 2.15
[2025-11-06 18:42:18,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.69 | bwd: 41.88 | bwd_inner: 2.97 | bwd_allreduce: 38.76 | step: 2.24
 67%|██████▋   | 2340/3507 [57:32<25:44,  1.32s/it]                                                   {'loss': 0.252, 'learning_rate': 5.268984646146957e-06, 'epoch': 0.67}
 67%|██████▋   | 2340/3507 [57:32<25:44,  1.32s/it]tensor([[-4.1875, -4.4688, -1.3438,  2.8438, -1.5234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3125,  0.4961,  2.5156, -2.2344, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -4.9688, -1.8047,  2.8594, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -2.1250,  2.8750, -1.1953, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -1.6484,  3.1875, -0.9219, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.7188, -5.9062,  0.1826,  1.8203, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6719, -2.5156, -2.2344,  1.1484,  0.1777]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:42:20,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.35 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8125, -2.8125,  1.7578,  2.4375, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:20,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:42:20,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.91 | bwd_microstep: 3.38 | bwd_inner_microstep: 2.28 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.55
[2025-11-06 18:42:20,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.30 | bwd: 4.13 | bwd_inner: 2.96 | bwd_allreduce: 0.96 | step: 2.63
 67%|██████▋   | 2341/3507 [57:34<28:09,  1.45s/it]                                                   {'loss': 0.2756, 'learning_rate': 5.260848565228882e-06, 'epoch': 0.67}
 67%|██████▋   | 2341/3507 [57:34<28:09,  1.45s/it]tensor([[-6.0312, -3.0781,  1.0234, -0.8398, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6250, -4.0312,  1.1094, -1.6641, -6.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9375, -2.9375,  2.3906,  0.8047, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:20,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 233.22 | bwd_microstep: 1.30 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-4.2188, -3.5312, -0.0977,  2.4844, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -4.0000,  0.5000,  2.9375, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219,  1.3359,  3.6719, -2.1562, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.5781,  1.8047,  3.6406, -2.0469, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750, -3.5781, -1.4297,  2.3594, -0.9961]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:42:21,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:42:21,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.02 | bwd_microstep: 130.81 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 129.74 | step_microstep: 2.88
[2025-11-06 18:42:21,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.28 | bwd: 132.10 | bwd_inner: 2.10 | bwd_allreduce: 129.80 | step: 3.01
 67%|██████▋   | 2342/3507 [57:34<22:53,  1.18s/it]                                                   {'loss': 1.0596, 'learning_rate': 5.252716528086319e-06, 'epoch': 0.67}
 67%|██████▋   | 2342/3507 [57:34<22:53,  1.18s/it]tensor([[-3.7188, -0.9883,  3.0312,  1.6250, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281, -1.4375,  1.4922,  0.0258, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1562, -2.4219, -1.0625,  2.2500, -0.2559]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.2188, -4.0938, -0.8086,  2.5312, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -3.8906,  0.6250, -0.4551, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625, -2.0781,  1.3125,  4.1250, -0.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -5.0000,  0.0635,  3.3125, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:22,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.31 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1875, -1.9453,  3.3438,  1.5625, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:22,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.50 | optimizer_step: 0.38
[2025-11-06 18:42:22,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.70 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.87
[2025-11-06 18:42:22,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.03 | bwd: 2.83 | bwd_inner: 1.80 | bwd_allreduce: 0.87 | step: 3.97
 67%|██████▋   | 2343/3507 [57:36<24:02,  1.24s/it]                                                   {'loss': 1.6129, 'learning_rate': 5.244588541658078e-06, 'epoch': 0.67}
 67%|██████▋   | 2343/3507 [57:36<24:02,  1.24s/it]tensor([[-2.0625, -2.6719, -1.2812,  2.4219,  0.0223]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -1.1562,  2.5469, -1.4922, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5625, -4.3438,  1.0625, -0.7578, -6.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -1.0703,  2.6094,  2.1875, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5625, -4.9375, -0.5859, -1.5781, -5.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:23,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.57 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -3.6875,  0.5312,  0.8711, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8750, -3.3594, -1.0078,  1.4688, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -1.3359,  2.5781, -1.1953, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:42:23,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.23 | optimizer_step: 0.23
[2025-11-06 18:42:23,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 486.10 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 484.69 | step_microstep: 2.18
[2025-11-06 18:42:23,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.01 | bwd: 487.03 | bwd_inner: 2.11 | bwd_allreduce: 484.74 | step: 2.27
 67%|██████▋   | 2344/3507 [57:37<24:22,  1.26s/it]                                                   {'loss': 0.2892, 'learning_rate': 5.236464612879529e-06, 'epoch': 0.67}
 67%|██████▋   | 2344/3507 [57:37<24:22,  1.26s/it]tensor([[5.3750, 7.1875, 7.4062, 5.7188, 4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -5.8750, -1.2812,  2.6250, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:23,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.57 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1250, -3.2656,  1.1953,  1.9375, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -2.6875,  0.5039, -1.5938, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9375, -3.5781, -1.9766,  2.0000, -0.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.4688, -4.8438, -2.1250,  1.9375, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -4.1875,  0.5508,  3.0938, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -1.9766,  3.6250,  0.1602, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:42:24,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.27
[2025-11-06 18:42:24,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 78.55 | bwd_microstep: 578.75 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 577.77 | step_microstep: 2.10
[2025-11-06 18:42:24,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 245.14 | bwd: 579.71 | bwd_inner: 1.76 | bwd_allreduce: 577.83 | step: 2.19
 67%|██████▋   | 2345/3507 [57:38<22:02,  1.14s/it]                                                   {'loss': 0.9569, 'learning_rate': 5.228344748682574e-06, 'epoch': 0.67}
 67%|██████▋   | 2345/3507 [57:38<22:02,  1.14s/it]tensor([[-3.0625, -2.6250,  1.4766,  5.0625, -0.7461]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -3.9531, -0.7148,  2.4375, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:24,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.03 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -3.6250, -0.4688,  2.3438, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7227,  2.7500,  2.7812, -1.9141, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5000, -2.1875,  1.5703,  0.8242, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7500,  1.1797,  3.0469, -1.7969, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1719,  1.6797,  3.9219, -0.8242, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -0.5664,  2.9062, -0.6562, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:42:26,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:42:26,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.46 | bwd_microstep: 1260.32 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 1258.87 | step_microstep: 2.37
[2025-11-06 18:42:26,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.50 | bwd: 1261.23 | bwd_inner: 2.19 | bwd_allreduce: 1258.91 | step: 2.45
 67%|██████▋   | 2346/3507 [57:40<24:42,  1.28s/it]                                                   {'loss': 0.204, 'learning_rate': 5.220228955995654e-06, 'epoch': 0.67}
 67%|██████▋   | 2346/3507 [57:40<24:42,  1.28s/it]tensor([[-7.3125, -5.9688, -0.9805,  1.0547, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:26,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.73 | bwd_microstep: 1.79 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15
tensor([[-4.5312, -0.6172,  2.9688, -1.2031, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2188, -3.0469, -2.5781,  1.0859, -0.1650]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.0938, -3.5469,  1.6719,  3.4062, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.2656,  0.9648,  2.3906, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -1.3984,  3.5625, -0.1748, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -2.3125,  2.2031,  0.8203, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7812,  0.6836,  4.0000,  0.6641, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:27,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.91 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:42:27,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.75 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.82 | step_microstep: 4.84
[2025-11-06 18:42:27,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.51 | bwd: 3.51 | bwd_inner: 2.44 | bwd_allreduce: 0.88 | step: 5.00
 67%|██████▋   | 2347/3507 [57:41<26:04,  1.35s/it]                                                   {'loss': 0.9523, 'learning_rate': 5.2121172417437345e-06, 'epoch': 0.67}
 67%|██████▋   | 2347/3507 [57:41<26:04,  1.35s/it]tensor([[-1.0859,  2.5938,  4.0312, -0.6445, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -2.1562,  1.4609, -1.5469, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -4.9062, -1.1797,  2.8906, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -4.5000, -0.9023,  3.5469, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:27,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.10 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6250, -2.9219,  2.0469,  1.3281, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -3.5938,  0.1758,  2.4531, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -1.7891,  2.8750, -0.4570, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -1.5391,  2.1406, -1.4297, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:42:28,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:42:28,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.96 | bwd_microstep: 689.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 689.02 | step_microstep: 2.96
[2025-11-06 18:42:28,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.06 | bwd: 690.56 | bwd_inner: 1.36 | bwd_allreduce: 689.05 | step: 3.05
 67%|██████▋   | 2348/3507 [57:42<24:50,  1.29s/it]                                                   {'loss': 0.2095, 'learning_rate': 5.204009612848288e-06, 'epoch': 0.67}
 67%|██████▋   | 2348/3507 [57:42<24:50,  1.29s/it]tensor([[-6.1250, -5.9062, -1.8984,  1.9922, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:29,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.99 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9844, -2.6719,  0.5547,  3.9062, -0.8242]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.0000,  0.6602,  2.4219, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1953,  2.5625,  3.5000, -1.0625, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.7344, -0.1885,  3.0312, -0.5430, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4062, -6.3125, -1.7891,  0.3574, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -3.3125,  0.9492,  1.3828, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -4.8125,  0.6523,  1.7500, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:31,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:42:31,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.62 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.72
[2025-11-06 18:42:31,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.64 | bwd: 2.82 | bwd_inner: 1.83 | bwd_allreduce: 0.86 | step: 2.81
 67%|██████▋   | 2349/3507 [57:45<32:52,  1.70s/it]                                                   {'loss': 0.719, 'learning_rate': 5.19590607622731e-06, 'epoch': 0.67}
 67%|██████▋   | 2349/3507 [57:45<32:52,  1.70s/it]tensor([[-1.8984,  1.7812,  2.3125, -2.6250, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2188, -3.2656, -0.0540,  3.6094, -0.9570]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -3.8906,  0.3672,  1.2891, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:31,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.37 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4062, -5.4688, -1.5625,  2.6094, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -4.9375, -1.7109,  2.4219, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -3.5938, -0.0488,  1.2578, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -3.2188,  0.7031,  0.5547, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9375, -2.4531,  1.9297,  3.0469, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:42:37,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.49 | optimizer_step: 0.37
[2025-11-06 18:42:37,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.83 | bwd_microstep: 5357.66 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 5356.61 | step_microstep: 4.94
[2025-11-06 18:42:37,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 436.26 | bwd: 5358.46 | bwd_inner: 1.53 | bwd_allreduce: 5356.73 | step: 5.04
 67%|██████▋   | 2350/3507 [57:51<56:48,  2.95s/it]                                                   {'loss': 0.378, 'learning_rate': 5.187806638795313e-06, 'epoch': 0.67}
 67%|██████▋   | 2350/3507 [57:51<56:48,  2.95s/it]tensor([[-6.0312, -4.3125, -0.3320,  0.6797, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000,  0.7969,  2.2969, -2.5000, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.0938, -3.9219, -0.0874,  1.4375, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([3], device='cuda:0')
[2025-11-06 18:42:37,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.06 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8125, -4.0312, -0.6445, -0.5547, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -2.8125,  0.5586,  2.5312, -1.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.5156,  0.5703,  1.9609, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5625, -0.1934,  2.5938, -1.0156, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -1.4844,  2.9375,  0.0864, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:42:38,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:42:38,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.67 | bwd_microstep: 209.28 | bwd_inner_microstep: 2.05 | bwd_allreduce_microstep: 207.02 | step_microstep: 2.31
[2025-11-06 18:42:38,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.77 | bwd: 210.20 | bwd_inner: 2.93 | bwd_allreduce: 207.05 | step: 2.40
 67%|██████▋   | 2351/3507 [57:51<43:25,  2.25s/it]                                                   {'loss': 0.3722, 'learning_rate': 5.1797113074633e-06, 'epoch': 0.67}
 67%|██████▋   | 2351/3507 [57:51<43:25,  2.25s/it]tensor([[-2.7031, -3.4688, -2.1250,  1.9219, -0.3965]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:38,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 102.86 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.6250, -1.2969,  2.2969,  1.3359, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -4.0938,  0.4102,  2.9844, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469,  0.5039,  3.6406, -0.8047, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -3.7500, -0.1523,  1.6719, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -4.2812,  0.1719,  1.8438, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.8750,  0.1875,  2.1875, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1562, -3.7188, -1.7266,  2.4531, -0.7539]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:42:39,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:42:39,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.98 | bwd_microstep: 730.51 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 729.46 | step_microstep: 3.04
[2025-11-06 18:42:39,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 261.87 | bwd: 731.27 | bwd_inner: 1.65 | bwd_allreduce: 729.49 | step: 3.11
 67%|██████▋   | 2352/3507 [57:52<36:16,  1.88s/it]                                                   {'loss': 1.0227, 'learning_rate': 5.171620089138774e-06, 'epoch': 0.67}
 67%|██████▋   | 2352/3507 [57:52<36:16,  1.88s/it]tensor([[-4.3750, -0.2080,  3.7188, -1.1328, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:39,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.78 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2188, -1.1406,  1.6328,  3.0625, -0.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.2637,  3.3281,  3.1562, -1.5703, -1.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.1562, -0.8945,  3.4375, -1.1406, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -1.4219,  1.6406,  0.4492, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7656, -2.7500,  0.3320,  1.7578, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -3.6719,  0.1543,  2.9688, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -4.0000,  1.0234,  3.7656, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:42:39,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:42:39,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.65 | bwd_microstep: 134.17 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 133.16 | step_microstep: 3.70
[2025-11-06 18:42:39,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.45 | bwd: 135.00 | bwd_inner: 1.67 | bwd_allreduce: 133.20 | step: 3.80
 67%|██████▋   | 2353/3507 [57:53<28:05,  1.46s/it]                                                   {'loss': 0.5596, 'learning_rate': 5.163532990725728e-06, 'epoch': 0.67}
 67%|██████▋   | 2353/3507 [57:53<28:05,  1.46s/it]tensor([[-3.2656, -2.8906, -0.4336,  2.0156, -1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -0.5117,  1.9844, -0.7109, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9531, -3.6719, -1.5547,  2.9375, -0.3945]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -2.8906,  2.6562,  1.1250, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:39,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.17 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6250, -0.7305,  4.2500,  0.4395, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -1.2656,  3.5469,  0.8008, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -0.3496,  3.6094, -1.0938, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -4.0938,  1.8672,  0.4395, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:42:41,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:42:41,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.36 | bwd_microstep: 1291.37 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1290.32 | step_microstep: 2.02
[2025-11-06 18:42:41,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.55 | bwd: 1292.24 | bwd_inner: 1.74 | bwd_allreduce: 1290.36 | step: 2.09
 67%|██████▋   | 2354/3507 [57:55<29:26,  1.53s/it]                                                   {'loss': 0.4512, 'learning_rate': 5.15545001912464e-06, 'epoch': 0.67}
 67%|██████▋   | 2354/3507 [57:55<29:26,  1.53s/it]tensor([[-4.1562,  0.2891,  3.0312, -2.6406, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188, -2.8750,  0.5117,  3.5156, -1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -2.3125,  2.0156,  2.4219, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -3.9375,  0.3496,  3.8906, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:41,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.21 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14
tensor([[-0.2295,  3.1094,  2.8906, -1.7422, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6250, -4.8125, -2.0625,  1.3438, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8906,  1.8750,  3.8438, -2.8281, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5625, -3.9062,  0.7539,  2.0156, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:42:41,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:42:41,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.10 | bwd_microstep: 319.18 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 318.08 | step_microstep: 1.83
[2025-11-06 18:42:41,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 421.34 | bwd: 320.23 | bwd_inner: 1.83 | bwd_allreduce: 318.16 | step: 1.98
 67%|██████▋   | 2355/3507 [57:55<25:08,  1.31s/it]                                                   {'loss': 0.27, 'learning_rate': 5.147371181232468e-06, 'epoch': 0.67}
 67%|██████▋   | 2355/3507 [57:55<25:08,  1.31s/it]tensor([[-2.8438,  0.7422,  3.1406, -1.0781, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:42,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.56 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.9688, -1.3906,  3.0000, -2.4531, -5.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -3.7188, -0.2871,  1.8750, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -4.1875,  1.3281,  1.1953, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -0.4434,  3.4531, -2.0625, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969,  0.9102,  4.1875, -0.6914, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -2.3594,  0.4531,  1.1406, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.9688,  0.4668,  2.2969, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:42:43,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.27 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:42:43,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.44 | bwd_microstep: 1195.68 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1194.36 | step_microstep: 3.34
[2025-11-06 18:42:43,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.02 | bwd: 1196.65 | bwd_inner: 2.05 | bwd_allreduce: 1194.42 | step: 3.46
 67%|██████▋   | 2356/3507 [57:57<26:47,  1.40s/it]                                                   {'loss': 0.8225, 'learning_rate': 5.139296483942639e-06, 'epoch': 0.67}
 67%|██████▋   | 2356/3507 [57:57<26:47,  1.40s/it]tensor([[-5.7500, -3.5469,  1.1953,  1.2344, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:43,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.55 | bwd_microstep: 1.19 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.8125, -4.5000, -0.3105,  3.4844, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -5.1250, -1.6250,  2.5156, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.3906,  0.4102,  1.9531, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -4.5000, -1.4297,  3.2188, -1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -2.8281,  1.8516,  1.0469, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -4.2812, -2.7656,  1.1797, -1.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -1.4844,  2.0156, -0.7500, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:42:44,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:42:44,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.21 | bwd_microstep: 490.33 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 489.08 | step_microstep: 3.27
[2025-11-06 18:42:44,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.76 | bwd: 491.52 | bwd_inner: 2.21 | bwd_allreduce: 489.13 | step: 3.37
 67%|██████▋   | 2357/3507 [57:58<23:27,  1.22s/it]                                                   {'loss': 0.4869, 'learning_rate': 5.13122593414505e-06, 'epoch': 0.67}
 67%|██████▋   | 2357/3507 [57:58<23:27,  1.22s/it]tensor([[-1.7812,  1.6328,  2.2656, -1.7734, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:44,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.14 | bwd_microstep: 1.87 | bwd_inner_microstep: 1.61 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.21
tensor([[-5.2188, -1.4531,  2.1875, -1.3828, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -2.5625,  2.7188,  0.3320, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -2.3906,  0.9961, -0.0693, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -4.0312, -0.1172,  2.2500, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -5.0938, -0.2158,  1.8750, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438,  0.8555,  2.4844, -1.7891, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[2.4531, 4.8125, 5.4062, 3.5625, 1.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:42:46,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.24 | optimizer_gradients: 0.23 | optimizer_step: 0.30
[2025-11-06 18:42:46,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.23 | bwd_microstep: 1941.37 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 1940.40 | step_microstep: 4.49
[2025-11-06 18:42:46,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.37 | bwd: 1943.24 | bwd_inner: 2.50 | bwd_allreduce: 1940.52 | step: 4.70
 67%|██████▋   | 2358/3507 [58:00<29:41,  1.55s/it]                                                   {'loss': 0.2415, 'learning_rate': 5.1231595387260655e-06, 'epoch': 0.67}
 67%|██████▋   | 2358/3507 [58:00<29:41,  1.55s/it]tensor([[-1.9766,  1.7344,  3.1406, -1.8047, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:46,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.12 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.6289, -1.4609, -0.0134,  4.1250,  1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0312, -4.6250,  0.2051,  1.9609, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -3.1406,  0.4688, -0.2109, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -0.5703,  2.7969,  1.1016, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -4.6562, -1.1250,  3.2969, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7500, -1.9375,  3.0781, -0.5156, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -5.6875, -1.2109,  2.5625, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:42:48,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:42:48,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.98 | bwd_microstep: 1281.72 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1280.64 | step_microstep: 2.26
[2025-11-06 18:42:48,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.12 | bwd: 1282.73 | bwd_inner: 1.91 | bwd_allreduce: 1280.68 | step: 2.35
 67%|██████▋   | 2359/3507 [58:02<30:23,  1.59s/it]                                                   {'loss': 0.8108, 'learning_rate': 5.11509730456849e-06, 'epoch': 0.67}
 67%|██████▋   | 2359/3507 [58:02<30:23,  1.59s/it]tensor([[-4.4375, -3.1250,  0.6016,  1.9453, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:48,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.5938, -4.3125, -0.6602,  0.8633, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -3.8281, -0.0747,  2.5938, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -5.3125, -0.3457,  3.6719, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -0.8672,  3.5625,  0.2051, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -4.5000, -1.6562,  2.2188, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.2109,  2.7188,  2.9531, -2.4688, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.1562, -4.1875,  0.4668,  3.3125, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:42:48,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:42:48,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.01 | bwd_microstep: 65.99 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 64.93 | step_microstep: 1.58
[2025-11-06 18:42:48,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.68 | bwd: 67.11 | bwd_inner: 2.04 | bwd_allreduce: 64.96 | step: 1.66
 67%|██████▋   | 2360/3507 [58:02<23:50,  1.25s/it]                                                   {'loss': 0.1921, 'learning_rate': 5.107039238551588e-06, 'epoch': 0.67}
 67%|██████▋   | 2360/3507 [58:02<23:50,  1.25s/it]tensor([[-4.2500, -4.2188, -0.9414,  2.5938, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -1.8438,  2.6562,  0.3418, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -2.4531,  1.4531,  1.9062, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -3.7344,  1.5781,  1.7500, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:49,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.83 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.1562, -5.8438, -0.9609,  0.7812, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -4.0312,  1.5234,  2.3750, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -0.5547,  2.1250, -2.7500, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-1.2344,  1.8281,  2.1094, -1.1562, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:42:51,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:42:51,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.89 | bwd_microstep: 2200.38 | bwd_inner_microstep: 2.34 | bwd_allreduce_microstep: 2197.88 | step_microstep: 1.52
[2025-11-06 18:42:51,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.79 | bwd: 2201.14 | bwd_inner: 3.04 | bwd_allreduce: 2197.89 | step: 1.60
 67%|██████▋   | 2361/3507 [58:05<31:35,  1.65s/it]                                                   {'loss': 0.7365, 'learning_rate': 5.098985347551061e-06, 'epoch': 0.67}
 67%|██████▋   | 2361/3507 [58:05<31:35,  1.65s/it]tensor([[-5.0938, -0.5742,  3.1875, -2.5469, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.3125, -0.1934,  3.3750,  0.7969, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([2], device='cuda:0')
[2025-11-06 18:42:51,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.57 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.6406,  1.5859,  2.9844, -2.7031, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0781,  0.6953,  2.7031, -1.6328, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -3.8594, -0.1973,  1.6719, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -3.9219,  0.0518,  1.2031, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1875, -4.7500,  0.3438,  2.2656, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6562, -5.6562,  0.5742,  1.7031, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:42:51,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:42:51,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.97 | bwd_microstep: 76.95 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 75.68 | step_microstep: 1.37
[2025-11-06 18:42:51,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.57 | bwd: 77.79 | bwd_inner: 1.93 | bwd_allreduce: 75.73 | step: 1.45
 67%|██████▋   | 2362/3507 [58:05<24:45,  1.30s/it]                                                   {'loss': 0.1707, 'learning_rate': 5.090935638439061e-06, 'epoch': 0.67}
 67%|██████▋   | 2362/3507 [58:05<24:45,  1.30s/it]tensor([[-3.6094, -3.2969, -0.6953,  2.2500, -1.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -4.0625,  0.1328,  1.8125, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -1.2188,  3.0781, -1.2578, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -1.5234,  3.0000, -0.2910, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:52,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.71 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.3750, -5.6562,  0.3906,  2.1406, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5312,  2.3594,  3.1406, -2.3906, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -4.3438,  0.7305,  2.3281, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -1.2812,  2.8906, -0.5117, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:42:54,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.41 | optimizer_step: 0.35
[2025-11-06 18:42:54,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.74 | bwd_microstep: 1739.28 | bwd_inner_microstep: 2.26 | bwd_allreduce_microstep: 1736.81 | step_microstep: 3.49
[2025-11-06 18:42:54,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.48 | bwd: 1740.24 | bwd_inner: 3.13 | bwd_allreduce: 1736.88 | step: 3.57
 67%|██████▋   | 2363/3507 [58:07<29:46,  1.56s/it]                                                   {'loss': 0.14, 'learning_rate': 5.082890118084159e-06, 'epoch': 0.67}
 67%|██████▋   | 2363/3507 [58:07<29:46,  1.56s/it]tensor([[-4.1250, -0.4688,  3.0469, -0.8516, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:54,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.33 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.6875, -5.0625, -1.0156,  1.7578, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -3.7031,  1.0547,  0.0413, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -0.2412,  3.1875, -3.1875, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750,  0.2217,  3.9531, -2.1250, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-10.0000,  -6.8438,  -0.4004,  -1.4766,  -7.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.2812, -3.3750,  1.2031, -0.3301, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1562,  1.2344,  3.4219, -0.2617, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 18:42:55,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:42:55,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.95 | bwd_microstep: 857.95 | bwd_inner_microstep: 2.20 | bwd_allreduce_microstep: 855.56 | step_microstep: 1.85
[2025-11-06 18:42:55,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.33 | bwd: 858.97 | bwd_inner: 3.18 | bwd_allreduce: 855.59 | step: 1.93
 67%|██████▋   | 2364/3507 [58:09<27:43,  1.46s/it]                                                   {'loss': 0.4158, 'learning_rate': 5.0748487933513564e-06, 'epoch': 0.67}
 67%|██████▋   | 2364/3507 [58:09<27:43,  1.46s/it]tensor([[-3.8750, -3.0156,  0.3828,  2.0938, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1406, -0.8164,  2.1250,  1.1484, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -3.2500,  0.7188,  2.0000, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3125, -6.7812, -0.6523,  1.5156, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:55,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.54 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.0625, -2.7969,  1.4844,  1.3359, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1875, -4.2500,  1.8281,  0.8359, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -4.3125, -0.1875,  2.3438, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -2.3594,  1.0703,  1.1094, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:55,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.56 | optimizer_step: 0.56
[2025-11-06 18:42:55,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.13 | bwd_microstep: 5.20 | bwd_inner_microstep: 2.12 | bwd_allreduce_microstep: 2.77 | step_microstep: 5.17
[2025-11-06 18:42:55,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.74 | bwd: 6.14 | bwd_inner: 2.96 | bwd_allreduce: 2.84 | step: 5.30
 67%|██████▋   | 2365/3507 [58:09<21:56,  1.15s/it]                                                   {'loss': 0.3557, 'learning_rate': 5.0668116711020675e-06, 'epoch': 0.67}
 67%|██████▋   | 2365/3507 [58:09<21:56,  1.15s/it]tensor([[-5.0312, -5.3438, -2.2344,  2.2188, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -1.7188,  2.9062, -0.8711, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-8.5625, -6.7500, -0.3379,  1.5312, -5.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:42:55,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.62 | bwd_microstep: 2.21 | bwd_inner_microstep: 1.85 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.22
tensor([[-4.8438, -1.8750,  1.9141, -0.6172, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -2.0312,  2.3750, -0.8281, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -1.7031,  3.1406,  0.5273, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -0.6914,  2.9844, -1.4766, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -0.7266,  3.1562, -0.9297, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:42:56,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:42:56,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.90 | bwd_microstep: 170.57 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 169.45 | step_microstep: 2.15
[2025-11-06 18:42:56,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.53 | bwd: 172.77 | bwd_inner: 2.91 | bwd_allreduce: 169.57 | step: 2.37
 67%|██████▋   | 2366/3507 [58:10<18:28,  1.03it/s]                                                   {'loss': 0.6417, 'learning_rate': 5.058778758194134e-06, 'epoch': 0.67}
 67%|██████▋   | 2366/3507 [58:10<18:28,  1.03it/s]tensor([[-4.7812, -1.0547,  2.8125, -0.9062, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2188, -1.6250,  2.9375, -0.2266, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:56,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.75 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -4.3438, -1.0859,  3.4219, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -4.3438, -1.0938,  3.1562, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.0000, -8.8125, -3.8750,  0.6836, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -3.7188,  1.4922,  0.6758, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -3.7812,  0.3457,  1.8828, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -1.1875,  2.6094,  0.1060, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:42:59,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:42:59,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.41 | bwd_microstep: 2496.80 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 2495.78 | step_microstep: 1.65
[2025-11-06 18:42:59,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.22 | bwd: 2497.52 | bwd_inner: 1.56 | bwd_allreduce: 2495.82 | step: 1.73
 67%|██████▋   | 2367/3507 [58:13<29:44,  1.57s/it]                                                   {'loss': 0.5867, 'learning_rate': 5.050750061481799e-06, 'epoch': 0.67}
 67%|██████▋   | 2367/3507 [58:13<29:44,  1.57s/it]tensor([[-6.5312, -5.6562, -0.6523,  2.4688, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5781,  0.5078,  2.4688, -0.2852, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8906, -0.7109,  2.2344, -0.2432, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281,  0.3516,  3.3750, -0.6562, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:59,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 285.08 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4688, -1.1250,  1.6016,  0.5977, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594, -3.9844, -1.9531,  2.2656, -0.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -3.2188,  2.7500,  0.3496, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -4.3438,  1.0625,  3.3438, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:42:59,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.21 | optimizer_gradients: 0.23 | optimizer_step: 0.19
[2025-11-06 18:42:59,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.93 | bwd_microstep: 33.40 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 32.23 | step_microstep: 3.67
[2025-11-06 18:42:59,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 428.03 | bwd: 34.25 | bwd_inner: 1.81 | bwd_allreduce: 32.27 | step: 3.75
 68%|██████▊   | 2368/3507 [58:13<23:40,  1.25s/it]                                                   {'loss': 0.1333, 'learning_rate': 5.042725587815707e-06, 'epoch': 0.68}
 68%|██████▊   | 2368/3507 [58:13<23:40,  1.25s/it]tensor([[-4.9688, -5.6562, -3.0938,  1.6250, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -4.2500, -1.0469,  2.7188, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -2.9062,  1.2969,  2.0312, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:42:59,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.52 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.8438, -6.8438, -1.1094,  2.1250, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -0.6836,  3.3125, -1.2656, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -1.1484,  3.5312,  0.1846, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8594, -0.3496,  2.7812, -0.1709, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -4.2188,  0.0055,  1.9375, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:02,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:43:02,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.80 | bwd_microstep: 2164.37 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2163.30 | step_microstep: 3.20
[2025-11-06 18:43:02,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.31 | bwd: 2165.39 | bwd_inner: 1.89 | bwd_allreduce: 2163.35 | step: 3.30
 68%|██████▊   | 2369/3507 [58:16<31:41,  1.67s/it]                                                   {'loss': 0.1946, 'learning_rate': 5.034705344042898e-06, 'epoch': 0.68}
 68%|██████▊   | 2369/3507 [58:16<31:41,  1.67s/it]tensor([[-3.4219, -3.0938,  0.2773,  3.3281, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -5.5000, -0.7930,  1.1797, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:02,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.00 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.4062, -1.0859,  2.3438, -0.9570, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -3.6094,  1.6953,  0.8828, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1562, -3.0781, -1.0078,  4.0000,  0.3848]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -6.2812, -1.5547,  2.4062, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5469, -4.6250, -2.2188,  3.2812, -0.5820]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.2812, -4.7500,  1.2188,  1.3125, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:02,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:43:02,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.79 | bwd_microstep: 136.64 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 135.27 | step_microstep: 3.04
[2025-11-06 18:43:02,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.82 | bwd: 137.59 | bwd_inner: 2.14 | bwd_allreduce: 135.31 | step: 3.12
 68%|██████▊   | 2370/3507 [58:16<25:18,  1.34s/it]                                                   {'loss': 0.1708, 'learning_rate': 5.0266893370068096e-06, 'epoch': 0.68}
 68%|██████▊   | 2370/3507 [58:16<25:18,  1.34s/it]tensor([[-8.4375, -7.1250, -0.9141,  1.9141, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -3.2656,  0.5078,  1.4375, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:03,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.30 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4688, -4.0938,  0.0742,  1.6797, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.0000,  0.7695,  1.8359, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -2.5625,  1.7891,  1.4766, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -2.6250,  1.2969,  0.5469, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8750,  1.0156,  2.8906, -1.9062, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -1.4766,  1.5781, -2.0625, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:43:06,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:43:06,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.39 | bwd_microstep: 2972.84 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 2971.40 | step_microstep: 2.58
[2025-11-06 18:43:06,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 278.67 | bwd: 2973.86 | bwd_inner: 2.24 | bwd_allreduce: 2971.45 | step: 2.66
 68%|██████▊   | 2371/3507 [58:20<36:31,  1.93s/it]                                                   {'loss': 0.2627, 'learning_rate': 5.018677573547255e-06, 'epoch': 0.68}
 68%|██████▊   | 2371/3507 [58:20<36:31,  1.93s/it]tensor([[-4.2812, -4.4062, -1.6719,  1.9766, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0312, -3.7188,  2.4688,  0.7031, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.7188, -0.2832,  2.4688, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:06,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.61 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -4.9062, -0.2002,  2.4844, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -1.6094,  2.4688,  2.7969, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -3.5781, -0.4863,  2.8750, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -2.6562,  3.2500,  0.2422, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -0.1660,  3.8125, -2.0781, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:43:06,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.24 | optimizer_step: 0.20
[2025-11-06 18:43:06,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.71 | bwd_microstep: 161.70 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 160.74 | step_microstep: 2.24
[2025-11-06 18:43:06,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.35 | bwd: 162.68 | bwd_inner: 1.73 | bwd_allreduce: 160.79 | step: 2.32
 68%|██████▊   | 2372/3507 [58:20<28:47,  1.52s/it]                                                   {'loss': 0.1293, 'learning_rate': 5.010670060500433e-06, 'epoch': 0.68}
 68%|██████▊   | 2372/3507 [58:20<28:47,  1.52s/it]tensor([[-5.7500, -2.2031,  2.5938, -0.3184, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:07,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.38 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.7812, -5.3125, -0.7344,  2.9219, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6250, -3.7188, -1.2812,  4.1875,  0.1226]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -5.0312, -2.7656,  2.0000, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2188,  0.6797,  2.9219,  0.3672, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -2.0469,  2.8594, -1.5391, -5.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8594, -0.2773,  1.0391, -3.0312, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -3.6875, -0.4316,  0.5039, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:43:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.18 | bwd_microstep: 3004.12 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 3003.16 | step_microstep: 1.84
[2025-11-06 18:43:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.58 | bwd: 3004.90 | bwd_inner: 1.54 | bwd_allreduce: 3003.21 | step: 1.93
 68%|██████▊   | 2373/3507 [58:24<39:05,  2.07s/it]                                                   {'loss': 0.4852, 'learning_rate': 5.002666804698911e-06, 'epoch': 0.68}
 68%|██████▊   | 2373/3507 [58:24<39:05,  2.07s/it]tensor([[-6.3125, -3.6875,  1.7266,  1.3125, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:10,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 78.40 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[-4.4375, -0.5664,  2.3438, -1.5312, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8750, -3.0469, -0.3945,  3.4844, -0.6367]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.5000, -0.2832,  1.8281, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -3.3438,  0.4902,  2.3125, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.6562, -3.4062,  1.8750,  0.1006, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([2], device='cuda:0')
tensor([[-2.6875, -3.2031, -1.6172,  2.2969, -0.4492]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.7031, -2.9219, -1.1797,  2.2812, -0.6367]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:10,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:43:10,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.62 | bwd_microstep: 96.08 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 94.89 | step_microstep: 1.97
[2025-11-06 18:43:10,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.05 | bwd: 97.04 | bwd_inner: 1.95 | bwd_allreduce: 94.93 | step: 2.03
 68%|██████▊   | 2374/3507 [58:24<29:44,  1.58s/it]                                                   {'loss': 0.5008, 'learning_rate': 4.994667812971633e-06, 'epoch': 0.68}
 68%|██████▊   | 2374/3507 [58:24<29:44,  1.58s/it]tensor([[-5.1562, -1.3438,  2.7969, -0.6680, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -0.5469,  2.9062,  0.8906, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -0.7109,  2.5312, -0.8047, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:10,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.78 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.8125, -0.4551,  3.4688,  2.4219, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.1875, -6.2812, -0.1211,  1.4141, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.0000, -4.0938, -0.6641, -4.6562, -7.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.5625, -5.0938,  1.1562,  1.3750, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9375, -2.4531,  1.7344, -1.3906, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:43:13,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:43:13,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.84 | bwd_microstep: 2395.40 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 2394.23 | step_microstep: 2.71
[2025-11-06 18:43:13,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.64 | bwd: 2396.37 | bwd_inner: 1.97 | bwd_allreduce: 2394.27 | step: 2.78
 68%|██████▊   | 2375/3507 [58:27<36:39,  1.94s/it]                                                   {'loss': 0.2142, 'learning_rate': 4.9866730921439e-06, 'epoch': 0.68}
 68%|██████▊   | 2375/3507 [58:27<36:39,  1.94s/it]tensor([[-5.1250, -0.9961,  3.2969, -1.0391, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -5.0625, -3.0625,  1.4219, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -2.8438,  1.4922,  1.2188, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:13,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.68 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5000, -4.2188, -0.8789,  2.6094, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1250, -2.3594,  1.4062,  3.9844, -1.1641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -4.4375,  1.3281,  0.3438, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.1094,  0.8984,  0.4512, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.8438, -5.1875, -1.5469, -2.7656, -6.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:13,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:43:13,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.34
[2025-11-06 18:43:13,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.01 | bwd: 2.77 | bwd_inner: 1.84 | bwd_allreduce: 0.79 | step: 2.45
 68%|██████▊   | 2376/3507 [58:27<28:00,  1.49s/it]                                                   {'loss': 0.6047, 'learning_rate': 4.978682649037356e-06, 'epoch': 0.68}
 68%|██████▊   | 2376/3507 [58:27<28:00,  1.49s/it]tensor([[-5.0625, -3.5781,  0.7305,  2.0625, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6875, -5.8750, -1.2344,  1.6875, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -1.8438,  2.9062, -0.1592, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:14,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.90 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.9688,  0.0405,  2.3281, -0.5312, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -5.1562,  0.0981,  1.1016, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5938, -2.5000,  2.6250, -1.4062, -5.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -3.0312,  0.4961,  1.1641, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9219,  0.6328,  2.0781, -2.0781, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:43:14,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:43:14,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 92.93 | bwd_microstep: 544.42 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 543.25 | step_microstep: 2.67
[2025-11-06 18:43:14,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.86 | bwd: 545.26 | bwd_inner: 1.84 | bwd_allreduce: 543.29 | step: 2.74
 68%|██████▊   | 2377/3507 [58:28<24:31,  1.30s/it]                                                   {'loss': 0.4391, 'learning_rate': 4.9706964904700096e-06, 'epoch': 0.68}
 68%|██████▊   | 2377/3507 [58:28<24:31,  1.30s/it]tensor([[-6.4062, -5.0312, -0.2051,  1.7109, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -4.4375, -0.0491,  0.6602, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -5.0000, -1.3750,  3.3906, -1.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:14,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.11 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7500, -1.1016,  3.1250, -0.2432, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -5.2812, -0.9883,  1.8203, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -1.9844,  2.1094, -0.7305, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.6875, -0.5352,  3.0938, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -0.7812,  3.1250, -1.9062, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:43:15,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:43:15,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.42 | bwd_microstep: 139.03 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 137.51 | step_microstep: 1.92
[2025-11-06 18:43:15,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.57 | bwd: 139.78 | bwd_inner: 2.08 | bwd_allreduce: 137.56 | step: 2.00
 68%|██████▊   | 2378/3507 [58:29<20:06,  1.07s/it]                                                   {'loss': 0.1899, 'learning_rate': 4.962714623256217e-06, 'epoch': 0.68}
 68%|██████▊   | 2378/3507 [58:29<20:06,  1.07s/it]tensor([[-5.1250, -2.6562,  1.9297,  1.2266, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1562, -2.8281, -1.6953,  1.9609, -0.0767]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.1875, -0.8633,  2.7031,  0.0425, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:15,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.96 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4844, -4.0312, -1.3672,  3.0156, -0.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.8438, -4.9375,  1.1562,  0.4434, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -3.1875,  1.4531,  3.9688, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -3.4531, -0.1094,  1.4141, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -1.4375,  0.5586, -1.5156, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:43:18,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:43:18,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.67 | bwd_microstep: 2408.92 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2407.85 | step_microstep: 2.30
[2025-11-06 18:43:18,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.66 | bwd: 2409.80 | bwd_inner: 1.78 | bwd_allreduce: 2407.89 | step: 2.37
 68%|██████▊   | 2379/3507 [58:31<30:08,  1.60s/it]                                                   {'loss': 0.9023, 'learning_rate': 4.954737054206658e-06, 'epoch': 0.68}
 68%|██████▊   | 2379/3507 [58:31<30:08,  1.60s/it]tensor([[-4.0625,  0.0287,  2.7188, -2.2812, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:18,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.55 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2188, -1.7891,  2.8906, -1.7578, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3906, -0.6367,  2.6406,  2.8438, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8047,  3.1250,  3.2031, -2.4688, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -2.0000,  2.7344, -0.4512, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812, -0.1670,  2.1719, -1.1328, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3906,  0.9180,  3.6094, -1.6250, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.9844,  1.1172,  1.3438, -1.7891, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
[2025-11-06 18:43:18,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:43:18,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.54 | bwd_microstep: 212.25 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 211.18 | step_microstep: 2.25
[2025-11-06 18:43:18,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 246.09 | bwd: 213.25 | bwd_inner: 1.90 | bwd_allreduce: 211.22 | step: 2.33
 68%|██████▊   | 2380/3507 [58:32<23:49,  1.27s/it]                                                   {'loss': 1.4613, 'learning_rate': 4.946763790128362e-06, 'epoch': 0.68}
 68%|██████▊   | 2380/3507 [58:32<23:49,  1.27s/it]tensor([[-5.0625, -3.4531,  0.9922,  2.1406, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.5781,  0.4043,  2.4219, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:18,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.38 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.4688, -5.6562, -0.6797,  2.2812, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.9531,  1.0547,  1.6562, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -0.8125,  3.3125, -3.0938, -6.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-0.7109,  3.1094,  3.0781, -2.1094, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969,  0.0698,  2.5312, -1.8906, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.3965,  2.3281,  2.5156, -0.6641, -1.0547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:43:20,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:43:20,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.71 | bwd_microstep: 1560.76 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1559.65 | step_microstep: 1.90
[2025-11-06 18:43:20,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.12 | bwd: 1561.62 | bwd_inner: 1.78 | bwd_allreduce: 1559.70 | step: 1.98
 68%|██████▊   | 2381/3507 [58:34<28:01,  1.49s/it]                                                   {'loss': 0.8208, 'learning_rate': 4.93879483782466e-06, 'epoch': 0.68}
 68%|██████▊   | 2381/3507 [58:34<28:01,  1.49s/it]tensor([[-5.8438, -5.1250, -0.7070,  2.1719, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6250, -5.7812, -0.4238,  2.6250, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -5.0938, -0.5117,  1.6797, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:20,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.60 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5156, -1.9141,  0.3789,  2.2969, -0.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -3.0625,  0.6289,  1.7344, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8125, -4.2812,  1.8359,  1.6406, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0938, -4.3125, -0.1162,  0.8398, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -0.8438,  3.8906, -1.9531, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:43:21,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:43:21,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.58 | bwd_microstep: 942.67 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 941.69 | step_microstep: 2.06
[2025-11-06 18:43:21,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.21 | bwd: 943.68 | bwd_inner: 1.80 | bwd_allreduce: 941.74 | step: 2.14
 68%|██████▊   | 2382/3507 [58:35<27:21,  1.46s/it]                                                   {'loss': 0.3266, 'learning_rate': 4.930830204095233e-06, 'epoch': 0.68}
 68%|██████▊   | 2382/3507 [58:35<27:21,  1.46s/it]tensor([[-6.7812, -4.6562,  1.1641,  2.0469, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:22,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.94 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7500, -4.3750,  0.6992,  0.3809, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -5.7812, -2.4219,  0.6719, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -3.0469,  2.1562,  1.5781, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.0781,  0.1895,  0.6680, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1719, -3.2344, -2.5938,  1.4062,  0.0096]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2969,  0.6758,  3.9219, -0.6719, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6250, -1.7734,  2.9375, -0.4180, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:43:22,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:43:22,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.84 | bwd_microstep: 541.74 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 540.99 | step_microstep: 2.22
[2025-11-06 18:43:22,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.81 | bwd: 542.45 | bwd_inner: 1.30 | bwd_allreduce: 541.03 | step: 2.30
 68%|██████▊   | 2383/3507 [58:36<24:19,  1.30s/it]                                                   {'loss': 0.3452, 'learning_rate': 4.922869895736058e-06, 'epoch': 0.68}
 68%|██████▊   | 2383/3507 [58:36<24:19,  1.30s/it]tensor([[-1.2578,  0.8594,  2.5469,  1.3203, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:23,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.07 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -2.0938,  2.0000,  1.0781, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -4.2812,  0.2432,  2.7344, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -2.8906,  2.3750, -0.8086, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.2031,  1.0703,  2.7031, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6562, -2.2188,  1.6094,  0.7422, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9844, -2.6406,  0.0654,  2.8281, -1.0703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4062, -3.3125, -0.2207,  2.9688, -1.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:43:25,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:43:25,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.15 | bwd_microstep: 1917.58 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 1916.69 | step_microstep: 2.06
[2025-11-06 18:43:25,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.26 | bwd: 1918.30 | bwd_inner: 1.42 | bwd_allreduce: 1916.74 | step: 2.15
 68%|██████▊   | 2384/3507 [58:39<29:51,  1.60s/it]                                                   {'loss': 0.8321, 'learning_rate': 4.914913919539429e-06, 'epoch': 0.68}
 68%|██████▊   | 2384/3507 [58:39<29:51,  1.60s/it]tensor([[-5.8438, -2.5625,  1.3047, -1.3516, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6992, -1.6562, -0.5664,  3.8438,  1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:43:25,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.02 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5781, -3.9844, -1.3906,  2.6094, -1.1016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -5.4375, -1.8203,  3.0312, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.5469,  1.3906,  2.2031, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7188, -3.4531,  2.5000,  0.7383, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -5.4688,  0.6836,  2.0000, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -2.9219,  1.3281,  2.1094, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:25,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:43:25,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.80 | bwd_microstep: 168.72 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 167.62 | step_microstep: 1.49
[2025-11-06 18:43:25,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.85 | bwd: 169.64 | bwd_inner: 1.86 | bwd_allreduce: 167.66 | step: 1.57
 68%|██████▊   | 2385/3507 [58:39<23:40,  1.27s/it]                                                   {'loss': 0.7127, 'learning_rate': 4.906962282293941e-06, 'epoch': 0.68}
 68%|██████▊   | 2385/3507 [58:39<23:40,  1.27s/it]tensor([[-5.2188, -1.8125,  2.9062,  0.4082, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -5.4688, -2.2500,  2.6719, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -3.9531,  0.6250,  1.6172, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:25,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -2.5000,  2.8281,  0.0559, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.2500, -6.1250, -0.1787,  2.7344, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -0.6484,  3.5000, -0.5469, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875,  0.2080,  3.8750, -1.5781, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -5.1562, -1.3906,  3.2969, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:43:27,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:43:27,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.66 | bwd_microstep: 1873.43 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1872.34 | step_microstep: 4.84
[2025-11-06 18:43:27,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.37 | bwd: 1874.34 | bwd_inner: 1.83 | bwd_allreduce: 1872.38 | step: 4.92
 68%|██████▊   | 2386/3507 [58:41<29:21,  1.57s/it]                                                   {'loss': 1.1055, 'learning_rate': 4.899014990784485e-06, 'epoch': 0.68}
 68%|██████▊   | 2386/3507 [58:41<29:21,  1.57s/it]tensor([[-4.9062, -4.9062, -1.0312,  2.9531, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -5.2188, -3.3281,  0.6523, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:43:28,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.83 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.4062, -2.6562,  1.5391,  2.3438, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3438,  1.3594,  2.9844, -1.6797, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6562, -0.4219,  3.1875, -1.2500, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7695,  2.0156,  1.7734, -1.0000, -1.2266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.7188, -4.5000,  1.5000,  0.0713, -5.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8750, -4.4375,  0.2490,  1.8047, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:43:28,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:43:28,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.42 | bwd_microstep: 22.73 | bwd_inner_microstep: 6.09 | bwd_allreduce_microstep: 16.54 | step_microstep: 4.89
[2025-11-06 18:43:28,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.28 | bwd: 23.62 | bwd_inner: 6.88 | bwd_allreduce: 16.58 | step: 4.97
 68%|██████▊   | 2387/3507 [58:42<23:06,  1.24s/it]                                                   {'loss': 0.9572, 'learning_rate': 4.891072051792249e-06, 'epoch': 0.68}
 68%|██████▊   | 2387/3507 [58:42<23:06,  1.24s/it]tensor([[-6.5938, -2.1719,  2.9844, -1.7578, -6.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.5938,  0.3945,  3.5938, -1.7109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5000, -3.9375, -1.8125,  2.0000, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -3.0469, -2.2188,  1.0000, -0.5391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:43:28,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.15 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3125, -5.0625, -1.2969,  2.0312, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.4062, -5.2812,  0.6523,  1.4844, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1875,  1.3672,  2.9375, -1.2188, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.7500, -6.0000, -1.7734, -0.8828, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:43:31,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.22 | optimizer_step: 0.34
[2025-11-06 18:43:31,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.77 | bwd_microstep: 1989.39 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 1988.47 | step_microstep: 2.43
[2025-11-06 18:43:31,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.94 | bwd: 1990.26 | bwd_inner: 1.56 | bwd_allreduce: 1988.53 | step: 2.52
 68%|██████▊   | 2388/3507 [58:44<31:01,  1.66s/it]                                                   {'loss': 0.3635, 'learning_rate': 4.8831334720947035e-06, 'epoch': 0.68}
 68%|██████▊   | 2388/3507 [58:44<31:01,  1.66s/it]tensor([[-4.0312, -0.5898,  2.7031, -0.0603, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -4.2812, -1.6875,  2.3750, -1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:31,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.28 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -0.5547,  1.4375, -3.1406, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -3.0625,  1.3750, -0.3047, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5312, -4.5000, -0.2471, -0.0437, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -4.0625, -0.2930,  3.7031, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8281,  1.1562,  3.2969, -1.7188, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -5.4688, -1.8594,  2.3750, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:32,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:43:32,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.73 | bwd_microstep: 566.03 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 564.83 | step_microstep: 1.83
[2025-11-06 18:43:32,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.03 | bwd: 566.91 | bwd_inner: 1.92 | bwd_allreduce: 564.86 | step: 1.90
 68%|██████▊   | 2389/3507 [58:45<27:17,  1.46s/it]                                                   {'loss': 0.1555, 'learning_rate': 4.875199258465594e-06, 'epoch': 0.68}
 68%|██████▊   | 2389/3507 [58:45<27:17,  1.46s/it]tensor([[-6.2188, -3.3281,  2.2031,  1.0781, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625,  0.6836,  3.3906, -0.5586, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -3.6875,  0.3359,  3.1406, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:32,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.40 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.0625, -2.2500,  0.5156, -0.0496, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -4.9688, -0.1777,  1.9766, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -5.4062, -0.7656,  3.0000, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -3.9219, -0.6719,  2.5469, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312,  0.2637,  3.7188, -1.6641, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:43:34,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 18:43:34,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.59 | bwd_microstep: 1447.13 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1446.05 | step_microstep: 2.07
[2025-11-06 18:43:34,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.03 | bwd: 1448.04 | bwd_inner: 1.79 | bwd_allreduce: 1446.10 | step: 2.16
 68%|██████▊   | 2390/3507 [58:47<30:33,  1.64s/it]                                                   {'loss': 0.2178, 'learning_rate': 4.867269417674956e-06, 'epoch': 0.68}
 68%|██████▊   | 2390/3507 [58:47<30:33,  1.64s/it]tensor([[-5.6250, -4.7812, -0.7695,  1.4688, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -4.3438, -0.7383,  3.5781, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:34,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.51 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0469, -3.5312, -0.6758,  3.7500, -0.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500,  0.0791,  3.6875, -1.3125, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -0.3848,  2.8594, -0.3906, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.4375, -4.4688,  1.6016,  0.5430, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9219, -1.5156,  0.2852,  4.3125,  1.0547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -5.0312, -2.3906,  1.8438, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:34,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:43:34,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.51 | bwd_microstep: 2.91 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 1.97 | step_microstep: 3.15
[2025-11-06 18:43:34,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.97 | bwd: 3.69 | bwd_inner: 1.51 | bwd_allreduce: 2.01 | step: 3.24
 68%|██████▊   | 2391/3507 [58:48<23:43,  1.28s/it]                                                   {'loss': 0.0847, 'learning_rate': 4.8593439564890844e-06, 'epoch': 0.68}
 68%|██████▊   | 2391/3507 [58:48<23:43,  1.28s/it]tensor([[-3.9688, -1.3828,  2.5312,  1.7891, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -3.0469, -0.1147,  1.7969, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -4.6875, -0.4277,  2.5938, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:34,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.39 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -2.0781,  1.8750,  1.0234, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -3.1562,  1.0703,  0.6211, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -5.4062, -1.3750,  1.6406, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531,  0.5117,  3.7344, -2.0156, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6562, -5.7500,  0.7305,  2.3438, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:43:37,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:43:37,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.46 | bwd_microstep: 2463.55 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 2462.54 | step_microstep: 1.87
[2025-11-06 18:43:37,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.86 | bwd: 2464.59 | bwd_inner: 1.88 | bwd_allreduce: 2462.58 | step: 1.95
 68%|██████▊   | 2392/3507 [58:51<32:31,  1.75s/it]                                                   {'loss': 0.7499, 'learning_rate': 4.851422881670529e-06, 'epoch': 0.68}
 68%|██████▊   | 2392/3507 [58:51<32:31,  1.75s/it]tensor([[-4.1250, -3.8906, -0.7656,  2.4844, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -4.4375, -2.3906,  1.4766, -1.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:37,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.70 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7188, -4.8750, -1.4141,  2.6875, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7188, -2.4375,  1.5078, -0.9805, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -4.1562,  0.1494,  2.5000, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9375, -2.9375,  2.2031,  0.6797, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -0.4902,  3.3750, -0.1328, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -3.9844, -0.0496, -0.4512, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:43:37,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:43:37,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.64 | bwd_microstep: 88.02 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 86.75 | step_microstep: 2.19
[2025-11-06 18:43:37,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.36 | bwd: 89.16 | bwd_inner: 2.19 | bwd_allreduce: 86.82 | step: 2.28
 68%|██████▊   | 2393/3507 [58:51<25:03,  1.35s/it]                                                   {'loss': 0.4546, 'learning_rate': 4.843506199978104e-06, 'epoch': 0.68}
 68%|██████▊   | 2393/3507 [58:51<25:03,  1.35s/it]tensor([[-3.5938, -1.7734,  1.2891,  1.1953, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -4.5938,  0.2285,  2.2188, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:38,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.82 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-6.6250, -5.8438, -1.7812,  0.8008, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -1.3281,  2.4531,  0.0615, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -1.4062,  2.3438, -0.0325, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0312, -1.4766,  1.9766,  0.8242, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -6.0938, -0.5781,  2.2500, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.0625, -9.0625, -3.9062,  1.1172, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:43:38,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:43:38,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.00 | bwd_microstep: 258.25 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 257.01 | step_microstep: 1.74
[2025-11-06 18:43:38,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.84 | bwd: 259.34 | bwd_inner: 2.12 | bwd_allreduce: 257.06 | step: 1.82
 68%|██████▊   | 2394/3507 [58:52<21:09,  1.14s/it]                                                   {'loss': 0.1872, 'learning_rate': 4.835593918166885e-06, 'epoch': 0.68}
 68%|██████▊   | 2394/3507 [58:52<21:09,  1.14s/it]tensor([[-4.1875,  0.1445,  1.9453, -3.6875, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.2500, -4.5625, -0.2393,  0.6836, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -1.6328,  2.0156,  0.5078, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -4.5312, -1.1875,  4.0938, -0.8242]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -0.5234,  3.2188, -1.0469, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.0938, -5.0312, -1.5469,  0.3672, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:39,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.96 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -5.3438, -2.0469,  1.9219, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000,  0.4746,  3.4062, -2.2500, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:43:40,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:43:40,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.68 | bwd_microstep: 304.03 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 302.88 | step_microstep: 1.79
[2025-11-06 18:43:40,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.65 | bwd: 305.06 | bwd_inner: 2.02 | bwd_allreduce: 302.92 | step: 1.87
 68%|██████▊   | 2395/3507 [58:54<25:54,  1.40s/it]                                                   {'loss': 1.187, 'learning_rate': 4.827686042988181e-06, 'epoch': 0.68}
 68%|██████▊   | 2395/3507 [58:54<25:54,  1.40s/it]tensor([[-5.5000, -1.7422,  2.9688, -0.3750, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -2.1250,  1.2266, -1.0078, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -5.0000, -0.5820,  2.8594, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500, -3.5156, -1.8438,  2.2344, -0.4277]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:40,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.67 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.7500, -5.1875, -0.3926,  2.9688, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -0.9961,  3.6875, -0.5195, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -0.8672,  3.0781, -1.6172, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -4.6875,  1.0312,  2.9219, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:43:41,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.65 | optimizer_gradients: 0.23 | optimizer_step: 0.27
[2025-11-06 18:43:41,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.56 | bwd_microstep: 917.02 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 916.06 | step_microstep: 6.66
[2025-11-06 18:43:41,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.27 | bwd: 917.88 | bwd_inner: 1.62 | bwd_allreduce: 916.11 | step: 6.78
 68%|██████▊   | 2396/3507 [58:55<25:58,  1.40s/it]                                                   {'loss': 0.9172, 'learning_rate': 4.8197825811895425e-06, 'epoch': 0.68}
 68%|██████▊   | 2396/3507 [58:55<25:58,  1.40s/it]tensor([[-5.3438, -4.6250, -0.0396,  2.7969, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219, -2.1719,  1.1016,  1.4688, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -5.3750, -1.6562,  2.4219, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -3.9844,  0.4844,  1.5625, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:42,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.15 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.51 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.20
tensor([[-5.3438, -2.7969,  2.1562,  1.3672, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5938,  1.1484,  2.5156, -2.2812, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7500, -1.6328,  3.0312,  1.0156, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.0000, -6.4062, -1.7188, -0.1406, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:43:43,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.25 | optimizer_step: 0.28
[2025-11-06 18:43:43,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.43 | bwd_microstep: 748.64 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 747.65 | step_microstep: 2.67
[2025-11-06 18:43:43,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.60 | bwd: 750.44 | bwd_inner: 2.40 | bwd_allreduce: 747.76 | step: 2.86
 68%|██████▊   | 2397/3507 [58:56<24:36,  1.33s/it]                                                   {'loss': 0.5064, 'learning_rate': 4.8118835395147565e-06, 'epoch': 0.68}
 68%|██████▊   | 2397/3507 [58:56<24:36,  1.33s/it]tensor([[-1.4141,  1.9062,  2.5000, -1.9609, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4688, -3.3281,  1.5781, -0.5781, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -1.1641,  2.6094, -1.0859, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:43,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1562, -1.4609,  3.4375, -0.0618, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -2.8438,  1.5391,  1.9844, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9219,  0.8242,  3.2656, -0.6719, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312,  0.2461,  3.7812, -1.5547, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.2754,  1.9453,  3.3438,  1.4141, -0.4805]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:43:44,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:43:44,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.49 | bwd_microstep: 1054.60 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1053.37 | step_microstep: 3.44
[2025-11-06 18:43:44,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.50 | bwd: 1055.58 | bwd_inner: 2.02 | bwd_allreduce: 1053.42 | step: 3.53
 68%|██████▊   | 2398/3507 [58:58<25:19,  1.37s/it]                                                   {'loss': 0.265, 'learning_rate': 4.803988924703839e-06, 'epoch': 0.68}
 68%|██████▊   | 2398/3507 [58:58<25:19,  1.37s/it]tensor([[-3.5156, -3.1406,  0.4863,  3.6250, -1.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -1.3359,  1.4375, -2.1094, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:44,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.60 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.1406, -3.5781, -0.7891,  3.5312, -0.6602]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5312,  0.9805,  4.4688,  0.9375, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -1.6562,  1.6094,  0.3926, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -4.0000, -0.4316,  2.5781, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -2.3281,  1.7266,  3.4375, -1.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -5.0000, -1.5781,  2.3750, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:43:45,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.59 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:43:45,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.49 | bwd_microstep: 890.00 | bwd_inner_microstep: 2.09 | bwd_allreduce_microstep: 887.71 | step_microstep: 4.52
[2025-11-06 18:43:45,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.15 | bwd: 890.93 | bwd_inner: 2.94 | bwd_allreduce: 887.74 | step: 4.62
 68%|██████▊   | 2399/3507 [58:59<24:46,  1.34s/it]                                                   {'loss': 0.705, 'learning_rate': 4.796098743493025e-06, 'epoch': 0.68}
 68%|██████▊   | 2399/3507 [58:59<24:46,  1.34s/it]tensor([[-4.3125, -1.5547,  2.0000,  0.4219, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -2.1875, -0.1602, -0.0161, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.1719,  1.1719,  1.1016, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9141, -2.6719, -0.4023,  4.3125,  0.4707]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -2.1250,  2.7656, -0.5977, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -4.0625,  0.8828,  1.0938, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:47,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.02 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.3125, -2.9375,  2.5469,  0.2422, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -4.6562,  0.1436,  1.2969, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:43:48,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:43:48,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.44 | bwd_microstep: 122.49 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 121.37 | step_microstep: 2.09
[2025-11-06 18:43:48,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 499.46 | bwd: 123.44 | bwd_inner: 1.88 | bwd_allreduce: 121.42 | step: 2.18
 68%|██████▊   | 2400/3507 [59:02<31:05,  1.68s/it]                                                   {'loss': 0.8738, 'learning_rate': 4.788213002614772e-06, 'epoch': 0.68}
 68%|██████▊   | 2400/3507 [59:02<31:05,  1.68s/it]tensor([[-5.5312, -4.0938,  0.4375,  2.0625, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -2.8125,  0.5352,  4.0625, -0.7539]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5938, -0.1973,  1.6562, -0.0811, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:43:48,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.68 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([2], device='cuda:3')
tensor([[-4.9688, -0.8516,  3.3281, -1.1641, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500,  1.4453,  3.2969, -1.8750, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9062, -5.3125,  0.2598,  2.1406, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0625, -3.0469,  0.6211, -1.1250, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -5.2500, -4.0938,  0.2422, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:48,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:43:48,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.24 | bwd_microstep: 292.94 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 291.90 | step_microstep: 1.67
[2025-11-06 18:43:48,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.96 | bwd: 293.76 | bwd_inner: 1.64 | bwd_allreduce: 291.95 | step: 1.77
 68%|██████▊   | 2401/3507 [59:02<25:38,  1.39s/it]                                                   {'loss': 0.3878, 'learning_rate': 4.780331708797744e-06, 'epoch': 0.68}
 68%|██████▊   | 2401/3507 [59:02<25:38,  1.39s/it]tensor([[-5.6875, -3.7188,  0.4766,  1.0469, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5312, -1.9922, -0.2578,  3.5000,  0.4336]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8438, -2.6094, -2.2656,  0.9336, -0.0059]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.2812, -2.0625,  1.5469, -0.9883, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7188, -5.5938, -0.2559,  2.5781, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688,  0.3438,  3.6406, -0.7930, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:50,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.12 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0938, -0.9961,  2.3125, -0.3770, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.7812, -5.5625, -0.6094,  1.4453, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:43:51,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:43:51,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.09 | bwd_microstep: 379.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 378.02 | step_microstep: 1.80
[2025-11-06 18:43:51,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.22 | bwd: 379.97 | bwd_inner: 1.74 | bwd_allreduce: 378.07 | step: 1.89
 68%|██████▊   | 2402/3507 [59:05<30:50,  1.67s/it]                                                   {'loss': 0.2919, 'learning_rate': 4.772454868766814e-06, 'epoch': 0.68}
 68%|██████▊   | 2402/3507 [59:05<30:50,  1.67s/it]tensor([[-4.6562, -3.1562,  0.5234,  1.1797, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5938,  0.0703,  1.7891, -0.3750, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:51,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.23 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0625, -4.9062, -1.0703,  2.5469, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -1.3047,  1.9531,  0.3730, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000,  0.7969,  2.9844, -1.5547, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -5.9375, -2.6094,  1.6562, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.2500, -7.5938, -2.0156,  1.9922, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -0.8555,  2.0938,  1.1562, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:52,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:43:52,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.31 | bwd_microstep: 756.25 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 754.94 | step_microstep: 2.18
[2025-11-06 18:43:52,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.57 | bwd: 757.11 | bwd_inner: 1.96 | bwd_allreduce: 755.00 | step: 2.27
 69%|██████▊   | 2403/3507 [59:06<27:52,  1.52s/it]                                                   {'loss': 0.389, 'learning_rate': 4.764582489243049e-06, 'epoch': 0.69}
 69%|██████▊   | 2403/3507 [59:06<27:52,  1.52s/it]tensor([[-5.1250, -1.4609,  2.8906, -1.0703, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.3438, -1.6719,  3.1250, -0.0581, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([2], device='cuda:2')
tensor([[-5.5000, -2.6250,  2.1875,  0.4629, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -4.8750, -2.5312,  2.2812, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.3281,  0.6211,  2.6406, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281, -2.8438,  0.4297,  3.2344, -1.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:54,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.76 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.0000,  2.7812,  2.7031, -2.2500, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1406, -0.2236,  2.8125,  0.4922, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:54,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 18:43:54,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.36 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.34
[2025-11-06 18:43:54,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.14 | bwd: 2.69 | bwd_inner: 1.77 | bwd_allreduce: 0.79 | step: 2.43
 69%|██████▊   | 2404/3507 [59:08<29:38,  1.61s/it]                                                   {'loss': 0.1643, 'learning_rate': 4.7567145769437184e-06, 'epoch': 0.69}
 69%|██████▊   | 2404/3507 [59:08<29:38,  1.61s/it]tensor([[-5.0625, -3.9062,  0.3789,  2.3906, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:54,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2500, -4.0312, -3.3281,  0.3887, -0.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.5625, -3.1250,  1.6719,  1.1875, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9375,  0.1582,  3.1875, -1.5312, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6875, -4.0312,  1.6562,  1.0312, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -1.4922,  3.2656, -1.2734, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438,  0.0215,  3.2031, -1.2266, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -5.3125, -1.7031,  1.5469, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:54,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.20 | optimizer_step: 0.17
[2025-11-06 18:43:54,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 79.06 | bwd_microstep: 104.56 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 103.36 | step_microstep: 1.98
[2025-11-06 18:43:54,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 230.18 | bwd: 105.54 | bwd_inner: 2.00 | bwd_allreduce: 103.40 | step: 2.06
 69%|██████▊   | 2405/3507 [59:08<22:44,  1.24s/it]                                                   {'loss': 0.3532, 'learning_rate': 4.748851138582269e-06, 'epoch': 0.69}
 69%|██████▊   | 2405/3507 [59:08<22:44,  1.24s/it]tensor([[-5.3750, -1.9844,  1.3281, -1.6953, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8125, -5.5000,  0.1553,  0.4102, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -0.5234,  3.2969, -1.5625, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -1.7344,  3.3750,  0.6523, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5938, -4.8125,  1.5234,  1.1484, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8750, -3.7812,  1.2500, -0.7070, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.9688, -7.3438, -2.1250,  1.7344, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:57,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.74 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0625, -3.9062,  0.2891,  2.4844, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:43:57,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:43:57,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.88 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.28
[2025-11-06 18:43:57,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.64 | bwd: 2.89 | bwd_inner: 1.85 | bwd_allreduce: 0.91 | step: 2.37
 69%|██████▊   | 2406/3507 [59:11<32:30,  1.77s/it]                                                   {'loss': 0.2736, 'learning_rate': 4.740992180868344e-06, 'epoch': 0.69}
 69%|██████▊   | 2406/3507 [59:11<32:30,  1.77s/it]tensor([[-6.1250, -5.1562, -0.7031,  1.6406, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -2.8281,  1.5703, -0.8281, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:57,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.32 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.3438, -6.4062, -1.5391,  1.3516, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.6562,  1.4375,  2.6719, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -0.8633,  3.1562, -0.2178, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -3.2031,  2.1562,  3.3281, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9375,  1.9375,  3.1406, -1.9844, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -4.3125, -1.2812,  2.8125, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:43:58,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:43:58,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.69 | bwd_microstep: 13.94 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 12.87 | step_microstep: 1.96
[2025-11-06 18:43:58,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.04 | bwd: 14.80 | bwd_inner: 1.77 | bwd_allreduce: 12.91 | step: 2.05
 69%|██████▊   | 2407/3507 [59:11<25:09,  1.37s/it]                                                   {'loss': 0.2896, 'learning_rate': 4.733137710507753e-06, 'epoch': 0.69}
 69%|██████▊   | 2407/3507 [59:11<25:09,  1.37s/it]tensor([[-5.5000, -2.0469, -0.1357, -3.5469, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.9062, -3.0156,  1.8672,  0.5898, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -0.3379,  3.6094, -0.7852, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -0.5625,  2.5312, -0.5664, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3750,  0.7109,  2.8125, -2.3750, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -1.0078,  0.4844, -1.3203, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:43:59,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 63.17 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -4.8438, -1.6094,  3.3750, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0078,  2.4219,  2.7344, -2.0312, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:44:01,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.20 | optimizer_step: 0.17
[2025-11-06 18:44:01,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.68 | bwd_microstep: 1827.38 | bwd_inner_microstep: 5.08 | bwd_allreduce_microstep: 1822.20 | step_microstep: 2.94
[2025-11-06 18:44:01,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 169.87 | bwd: 1828.08 | bwd_inner: 5.68 | bwd_allreduce: 1822.24 | step: 3.02
 69%|██████▊   | 2408/3507 [59:14<33:52,  1.85s/it]                                                   {'loss': 0.4831, 'learning_rate': 4.7252877342024825e-06, 'epoch': 0.69}
 69%|██████▊   | 2408/3507 [59:14<33:52,  1.85s/it]tensor([[-6.6875, -5.7188, -0.5156,  2.4688, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8594, -3.0781, -0.1621,  3.7188, -0.5586]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.1562, -4.0312,  0.8555,  3.2656, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -1.5469,  3.3125,  0.7031, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:01,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.21 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.8125, -3.6250,  0.3613,  1.7969, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.7969,  0.1650,  0.8555, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -5.1562, -2.7344,  1.5469, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1875, -3.4688, -0.7930,  3.0781, -0.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:44:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.81 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.01
[2025-11-06 18:44:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.01 | bwd: 3.16 | bwd_inner: 2.02 | bwd_allreduce: 0.97 | step: 2.12
 69%|██████▊   | 2409/3507 [59:15<26:29,  1.45s/it]                                                   {'loss': 0.6584, 'learning_rate': 4.717442258650672e-06, 'epoch': 0.69}
 69%|██████▊   | 2409/3507 [59:15<26:29,  1.45s/it]tensor([[-4.1875, -0.7109,  1.6797, -2.2812, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.8438, -3.5781,  0.4746,  0.1216, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9375, -6.6250, -2.8906,  0.3105, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -3.7812,  0.4277,  1.8750, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -3.0781,  0.9922,  1.2656, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.0781,  1.0078,  1.4531, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:01,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.78 | bwd_microstep: 6.09 | bwd_inner_microstep: 5.94 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.1406,  2.0312,  2.6719, -3.3125, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-6.9375, -4.9375,  1.0469,  2.2500, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:44:04,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:44:04,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.98 | bwd_microstep: 1663.72 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1662.69 | step_microstep: 2.32
[2025-11-06 18:44:04,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 449.80 | bwd: 1669.82 | bwd_inner: 6.89 | bwd_allreduce: 1662.75 | step: 2.44
 69%|██████▊   | 2410/3507 [59:18<33:30,  1.83s/it]                                                   {'loss': 0.9879, 'learning_rate': 4.709601290546638e-06, 'epoch': 0.69}
 69%|██████▊   | 2410/3507 [59:18<33:30,  1.83s/it]tensor([[-4.0625, -1.1328,  1.7734, -0.7227, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:04,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.66 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8125, -4.5312, -0.7969,  2.7344, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -4.0938, -1.5312,  2.1875, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0000,  2.4688,  4.8438,  0.5117, -1.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.4062, -0.4902,  1.5469, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9531, -0.5781,  1.5234, -0.4766, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -5.1562, -0.7773,  1.2969, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5625, -6.2188, -1.9531, -0.3047, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:44:04,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:44:04,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.67 | bwd_microstep: 78.66 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 77.36 | step_microstep: 1.43
[2025-11-06 18:44:04,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.35 | bwd: 79.75 | bwd_inner: 2.19 | bwd_allreduce: 77.40 | step: 1.52
 69%|██████▊   | 2411/3507 [59:18<25:48,  1.41s/it]                                                   {'loss': 0.1284, 'learning_rate': 4.701764836580841e-06, 'epoch': 0.69}
 69%|██████▊   | 2411/3507 [59:18<25:48,  1.41s/it]tensor([[ 0.0447,  3.3125,  2.6562, -1.7656, -1.1641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -4.0625, -1.9844,  2.5469, -0.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -5.0000,  0.1562,  1.8203, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2656,  1.1406,  3.3594, -0.6406, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:04,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.24 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -1.2969,  2.1875, -0.2695, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3750, -5.5312, -0.9766,  1.6016, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -1.8125,  2.6562,  0.0109, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8281,  1.4688,  2.7969, -3.1719, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:44:06,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 18:44:06,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.82 | bwd_microstep: 1167.88 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1166.67 | step_microstep: 2.17
[2025-11-06 18:44:06,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.08 | bwd: 1168.85 | bwd_inner: 2.00 | bwd_allreduce: 1166.72 | step: 2.26
 69%|██████▉   | 2412/3507 [59:20<26:11,  1.43s/it]                                                   {'loss': 0.2469, 'learning_rate': 4.693932903439893e-06, 'epoch': 0.69}
 69%|██████▉   | 2412/3507 [59:20<26:11,  1.43s/it]tensor([[-3.7969, -3.2500, -0.0613,  2.4531, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -0.2461,  2.9375, -0.9570, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:06,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.58 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.1250, -4.6875, -0.4062,  2.9062, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.2500, -7.6250, -1.6406,  2.5312, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8594,  1.7500,  2.4219, -2.1719, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -3.9531, -0.3984,  3.0469, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -4.9375, -1.2266,  3.1875, -1.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2500,  0.1050,  2.4531, -0.8438, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:44:06,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.81 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:44:06,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.03 | bwd_microstep: 7.57 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 6.54 | step_microstep: 2.31
[2025-11-06 18:44:06,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.63 | bwd: 8.66 | bwd_inner: 1.94 | bwd_allreduce: 6.58 | step: 2.40
 69%|██████▉   | 2413/3507 [59:20<20:29,  1.12s/it]                                                   {'loss': 0.5351, 'learning_rate': 4.686105497806545e-06, 'epoch': 0.69}
 69%|██████▉   | 2413/3507 [59:20<20:29,  1.12s/it]tensor([[-2.9219,  1.0312,  3.3438, -1.8828, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:06,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.14 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.9688,  0.6641,  3.9219, -2.2031, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7188, -3.6562,  0.3770,  0.3555, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -3.6406,  1.7266,  2.2344, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -4.6875, -0.2734,  3.7969, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5859,  2.2344,  4.3438, -0.2080, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-1.8125,  2.7188,  4.1250, -2.2812, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -0.5977,  2.7812, -1.5703, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:08,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:44:08,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.91 | bwd_microstep: 1.50 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.65 | step_microstep: 2.00
[2025-11-06 18:44:08,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.06 | bwd: 2.52 | bwd_inner: 1.66 | bwd_allreduce: 0.70 | step: 2.11
 69%|██████▉   | 2414/3507 [59:22<25:47,  1.42s/it]                                                   {'loss': 0.5464, 'learning_rate': 4.678282626359688e-06, 'epoch': 0.69}
 69%|██████▉   | 2414/3507 [59:22<25:47,  1.42s/it]tensor([[-5.8438, -3.0000,  1.3516, -0.3555, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8750, -4.2812,  1.4922,  1.2031, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -0.2930,  1.5156, -2.2344, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2500, -0.2109,  1.7109,  0.6641, -1.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:08,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.51 | bwd_microstep: 1.23 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.21
tensor([[-4.8750, -3.1719,  0.9258,  1.7812, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3750,  0.2080,  2.5156, -0.9570, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -0.3613,  2.8594, -0.8086, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0625, -4.2812, -0.7539,  3.7344, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:44:09,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:44:09,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.78 | bwd_microstep: 32.22 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 30.94 | step_microstep: 2.41
[2025-11-06 18:44:09,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.33 | bwd: 33.46 | bwd_inner: 2.15 | bwd_allreduce: 31.04 | step: 2.62
 69%|██████▉   | 2415/3507 [59:22<20:20,  1.12s/it]                                                   {'loss': 0.8074, 'learning_rate': 4.670464295774343e-06, 'epoch': 0.69}
 69%|██████▉   | 2415/3507 [59:22<20:20,  1.12s/it]tensor([[-6.1250, -5.8750, -2.2188,  1.2734, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -3.1719,  1.7422,  1.8281, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -2.7188,  2.1875,  0.0625, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:09,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.90 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-4.9375, -4.2188, -0.5547,  2.0156, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.2812,  1.3750,  2.5938, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -4.5625,  0.6992,  2.0312, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3750, -4.9062, -0.2676, -0.7969, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7188, -2.6406,  2.2031,  0.3633, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:11,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.14 | optimizer_step: 0.20
[2025-11-06 18:44:11,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.83 | bwd_microstep: 1.80 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.63 | step_microstep: 2.02
[2025-11-06 18:44:11,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 457.77 | bwd: 2.80 | bwd_inner: 1.93 | bwd_allreduce: 0.70 | step: 2.15
 69%|██████▉   | 2416/3507 [59:24<24:54,  1.37s/it]                                                   {'loss': 0.8688, 'learning_rate': 4.662650512721656e-06, 'epoch': 0.69}
 69%|██████▉   | 2416/3507 [59:24<24:54,  1.37s/it]tensor([[-3.1406, -3.7188, -2.5156,  1.0547, -0.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.2812, -0.1562,  2.2969, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:11,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.78 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-1.7969,  0.7383,  2.3750,  0.4395, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0938, -6.7812, -1.6875,  2.5625, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8789,  0.2773,  2.0625,  2.4219, -0.2285]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -4.5625, -0.4746,  1.3359, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3125, -3.6406,  0.7539,  3.8594, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.4453,  0.9844,  4.3750,  5.5000,  0.5508]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:11,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:44:11,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.13 | bwd_microstep: 333.23 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 332.11 | step_microstep: 1.97
[2025-11-06 18:44:11,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.94 | bwd: 334.10 | bwd_inner: 1.83 | bwd_allreduce: 332.14 | step: 2.05
 69%|██████▉   | 2417/3507 [59:25<21:16,  1.17s/it]                                                   {'loss': 0.5274, 'learning_rate': 4.654841283868894e-06, 'epoch': 0.69}
 69%|██████▉   | 2417/3507 [59:25<21:16,  1.17s/it]tensor([[-3.5156, -3.2500, -0.6484,  2.1719, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:11,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.66 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.3125, -1.4219,  1.1250, -3.0469, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7344,  2.0312,  3.1250, -1.6172, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.4062, -5.0312, -0.5820,  3.0312, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -0.9453,  2.5312, -0.0645, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -5.8125, -1.2500,  3.0625, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -3.7500,  0.4980,  0.7891, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6562, -2.7344,  2.6719,  1.1250, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:44:14,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:44:14,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.25 | bwd_microstep: 815.33 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 814.23 | step_microstep: 3.69
[2025-11-06 18:44:14,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.94 | bwd: 816.30 | bwd_inner: 1.89 | bwd_allreduce: 814.27 | step: 3.77
 69%|██████▉   | 2418/3507 [59:28<30:51,  1.70s/it]                                                   {'loss': 0.5416, 'learning_rate': 4.647036615879434e-06, 'epoch': 0.69}
 69%|██████▉   | 2418/3507 [59:28<30:51,  1.70s/it]tensor([[-4.6250, -5.0938, -2.0156,  2.6719, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:14,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.24 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.1719, -2.7656, -0.0811,  4.5938,  0.2852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -4.8750, -1.6797,  2.2500, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -2.3281,  1.7188,  0.6094, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -4.1250,  0.8164,  0.6836, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.9375, -0.9102,  2.4062, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -4.6250, -0.0505,  2.1094, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -4.3438,  0.5117,  2.4375, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:44:15,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 14.55 | optimizer_step: 0.28
[2025-11-06 18:44:15,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.09 | bwd_microstep: 53.66 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 52.67 | step_microstep: 17.49
[2025-11-06 18:44:15,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.36 | bwd: 54.44 | bwd_inner: 1.54 | bwd_allreduce: 52.72 | step: 17.58
 69%|██████▉   | 2419/3507 [59:29<24:02,  1.33s/it]                                                   {'loss': 0.1645, 'learning_rate': 4.6392365154127735e-06, 'epoch': 0.69}
 69%|██████▉   | 2419/3507 [59:29<24:02,  1.33s/it]tensor([[-4.2188, -2.0156,  2.1406,  1.8750, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0000,  0.6289,  2.9375, -0.7695, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000e+00, -3.9844e+00, -4.4861e-03,  2.1875e+00, -2.7969e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:15,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.77 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.2812, -4.2188, -0.5469,  3.0938, -1.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4688, -3.4375,  0.2969, -1.7344, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5156,  0.5781,  1.7891, -1.4844, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.0000, -2.7500,  2.3438, -1.9062, -6.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9531, -0.2812,  2.2812, -1.6953, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:44:17,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:44:17,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.07 | bwd_microstep: 179.85 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 178.79 | step_microstep: 2.08
[2025-11-06 18:44:17,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 440.88 | bwd: 180.76 | bwd_inner: 1.73 | bwd_allreduce: 178.86 | step: 2.18
 69%|██████▉   | 2420/3507 [59:30<26:42,  1.47s/it]                                                   {'loss': 0.3623, 'learning_rate': 4.631440989124496e-06, 'epoch': 0.69}
 69%|██████▉   | 2420/3507 [59:30<26:42,  1.47s/it]tensor([[-5.9688, -5.4375, -1.4922,  1.5781, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:17,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.62 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.5625, -4.5625, -1.3047,  2.2656, -1.9609]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-10.0000,  -8.7500,  -3.1406,  -0.1816,  -6.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -4.0312,  0.6289,  1.9688, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -4.5000, -2.2344,  2.3906, -0.9648]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -5.1562, -0.8477,  2.0156, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -3.4531,  1.3438,  3.0938, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312e+00, -3.2656e+00,  6.4062e-01,  6.0272e-04, -4.1250e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:18,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:44:18,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.02 | bwd_microstep: 1344.87 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1343.69 | step_microstep: 2.36
[2025-11-06 18:44:18,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.67 | bwd: 1345.68 | bwd_inner: 1.79 | bwd_allreduce: 1343.74 | step: 2.45
 69%|██████▉   | 2421/3507 [59:32<28:28,  1.57s/it]                                                   {'loss': 0.138, 'learning_rate': 4.623650043666293e-06, 'epoch': 0.69}
 69%|██████▉   | 2421/3507 [59:32<28:28,  1.57s/it]tensor([[-2.7969,  0.6211,  2.7188, -1.4609, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -1.1875,  2.2500, -0.6445, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -3.7656,  1.4297,  2.7812, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:19,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.09 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0625, -4.5625, -0.4805,  2.8750, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6875, -5.8438, -0.7070,  2.4219, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -2.7656,  1.2500,  2.6406, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -0.2119,  3.5938, -0.6953, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.3125, -6.8125, -1.6484,  0.2383, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:19,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:44:19,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.34 | bwd_microstep: 41.40 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 40.29 | step_microstep: 1.53
[2025-11-06 18:44:19,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.46 | bwd: 42.42 | bwd_inner: 1.97 | bwd_allreduce: 40.33 | step: 1.62
 69%|██████▉   | 2422/3507 [59:33<22:24,  1.24s/it]                                                   {'loss': 0.3549, 'learning_rate': 4.615863685685936e-06, 'epoch': 0.69}
 69%|██████▉   | 2422/3507 [59:33<22:24,  1.24s/it]tensor([[-4.0938, -0.5430,  2.8906, -0.2178, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:19,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.92 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5000, -0.6445,  3.5156, -0.2812, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7188,  1.5469,  3.4844, -2.2344, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-6.4062, -6.1562, -2.5312,  0.9883, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -2.2969,  2.2188, -0.2773, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.2344,  1.5000,  1.1484, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -2.3750,  2.5156,  0.3379, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -4.6562,  0.4004,  3.8125, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:21,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:44:21,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.80 | bwd_microstep: 1707.03 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1705.76 | step_microstep: 2.19
[2025-11-06 18:44:21,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.74 | bwd: 1707.89 | bwd_inner: 1.96 | bwd_allreduce: 1705.80 | step: 2.26
 69%|██████▉   | 2423/3507 [59:35<26:38,  1.47s/it]                                                   {'loss': 0.3786, 'learning_rate': 4.608081921827303e-06, 'epoch': 0.69}
 69%|██████▉   | 2423/3507 [59:35<26:38,  1.47s/it]tensor([[-3.9062, -1.3047,  1.9688,  0.1436, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.5000,  0.2031,  2.9844, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -4.5938,  1.1484,  2.9375, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750, -0.3320,  3.3125, -0.2637, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000,  0.1328,  3.1094,  0.4258, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -1.3438,  2.1875, -0.3418, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:22,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.84 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2188, -3.0469,  1.7734,  4.0625, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.0625,  1.2812,  3.1406, -0.8711, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:23,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.79 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:44:23,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.73 | bwd_microstep: 756.35 | bwd_inner_microstep: 2.14 | bwd_allreduce_microstep: 754.00 | step_microstep: 4.22
[2025-11-06 18:44:23,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.65 | bwd: 757.27 | bwd_inner: 3.02 | bwd_allreduce: 754.01 | step: 4.29
 69%|██████▉   | 2424/3507 [59:36<27:45,  1.54s/it]                                                   {'loss': 0.1193, 'learning_rate': 4.6003047587303376e-06, 'epoch': 0.69}
 69%|██████▉   | 2424/3507 [59:36<27:45,  1.54s/it]tensor([[-4.2188, -1.8438,  1.5000,  0.5625, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:23,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.30 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2812, -2.2188,  1.3906,  1.1641, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -0.6055,  2.5469, -1.4609, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6250, -3.4219,  1.8203, -0.2324, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2969, -3.9062, -1.6562,  2.5938, -0.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -1.0312,  3.5312, -0.7656, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0312, -4.0312, -2.5781,  1.9453, -0.4863]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8750, -3.3750,  1.6719, -0.9414, -5.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:25,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.04 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:44:25,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.75 | bwd_microstep: 1312.35 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1311.19 | step_microstep: 3.75
[2025-11-06 18:44:25,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.06 | bwd: 1313.14 | bwd_inner: 1.73 | bwd_allreduce: 1311.25 | step: 3.84
 69%|██████▉   | 2425/3507 [59:38<30:10,  1.67s/it]                                                   {'loss': 0.1771, 'learning_rate': 4.592532203031047e-06, 'epoch': 0.69}
 69%|██████▉   | 2425/3507 [59:38<30:10,  1.67s/it]tensor([[-4.0312, -4.0625, -0.7500,  3.0156, -1.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.7969, -0.0933,  2.3125, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188,  0.1021,  3.1094, -2.4688, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -4.6250, -2.0312,  2.0938, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:25,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.69 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 4.49
tensor([[-3.7812, -3.7031, -0.7734,  2.4531, -1.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -1.7734,  2.8594, -0.5781, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -4.6875, -1.8125,  2.4375, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8672, -2.7344, -2.1094,  1.4922,  0.1318]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:26,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.30 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:44:26,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.98 | bwd_microstep: 185.93 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 184.98 | step_microstep: 3.73
[2025-11-06 18:44:26,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.67 | bwd: 186.94 | bwd_inner: 1.74 | bwd_allreduce: 185.05 | step: 8.22
 69%|██████▉   | 2426/3507 [59:40<28:29,  1.58s/it]                                                   {'loss': 0.0817, 'learning_rate': 4.584764261361532e-06, 'epoch': 0.69}
 69%|██████▉   | 2426/3507 [59:40<28:29,  1.58s/it]tensor([[-3.5312, -1.2656,  2.9688,  2.2500, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -6.1562, -3.0469,  0.1025, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.6250, -0.1611,  3.5938, -1.7188, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1562, -2.1250,  3.5938, -0.1680, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:26,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.43 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1250, -0.5742,  2.7188, -0.9883, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.3438, -5.1250,  0.9492,  1.4688, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1875, -5.4688, -0.9102,  2.1719, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3438, -6.3125, -2.0000,  2.2344, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:44:28,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.24 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 18:44:28,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.81 | bwd_microstep: 1629.94 | bwd_inner_microstep: 1.75 | bwd_allreduce_microstep: 1628.03 | step_microstep: 3.71
[2025-11-06 18:44:28,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 582.31 | bwd: 1630.66 | bwd_inner: 2.38 | bwd_allreduce: 1628.08 | step: 3.80
 69%|██████▉   | 2427/3507 [59:42<32:13,  1.79s/it]                                                   {'loss': 1.0217, 'learning_rate': 4.577000940349939e-06, 'epoch': 0.69}
 69%|██████▉   | 2427/3507 [59:42<32:13,  1.79s/it]tensor([[-7.1562, -4.5000,  1.8047,  1.6875, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:28,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.04 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6875, -2.9375,  1.9531,  0.7188, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -4.6250, -1.6328,  2.6094, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9375,  2.0000,  2.6094, -3.1562, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9062, -2.9375, -0.5117,  2.6406, -0.8164]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5625,  0.8516,  2.3125, -1.5078, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9375, -2.0781,  1.1719,  1.0703, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -5.3750, -0.0240,  2.6406, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:29,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 18:44:29,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.96 | bwd_microstep: 851.63 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 850.61 | step_microstep: 3.70
[2025-11-06 18:44:29,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.01 | bwd: 852.40 | bwd_inner: 1.58 | bwd_allreduce: 850.66 | step: 3.78
 69%|██████▉   | 2428/3507 [59:43<29:19,  1.63s/it]                                                   {'loss': 0.5144, 'learning_rate': 4.569242246620477e-06, 'epoch': 0.69}
 69%|██████▉   | 2428/3507 [59:43<29:19,  1.63s/it]tensor([[-6.8125, -4.3125,  0.7734,  0.3164, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -4.1562, -1.9922,  2.1562, -1.0547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:30,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.13 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-7.0938, -4.7500,  1.5156,  2.0781, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6641,  1.6406,  4.8438,  1.2266, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4531, -3.3750, -3.0469,  0.6875, -0.2676]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-5.5312, -4.1250,  1.3594,  3.4062, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -4.5625,  0.1377,  2.2969, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -4.1875,  0.1729,  1.6406, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:44:31,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 18:44:31,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.15 | bwd_microstep: 950.95 | bwd_inner_microstep: 5.31 | bwd_allreduce_microstep: 945.53 | step_microstep: 2.28
[2025-11-06 18:44:31,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.30 | bwd: 952.11 | bwd_inner: 6.32 | bwd_allreduce: 945.59 | step: 2.38
 69%|██████▉   | 2429/3507 [59:45<27:51,  1.55s/it]                                                   {'loss': 0.3582, 'learning_rate': 4.561488186793407e-06, 'epoch': 0.69}
 69%|██████▉   | 2429/3507 [59:45<27:51,  1.55s/it]tensor([[-3.8594, -0.5781,  2.7500, -0.5977, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -3.2656,  0.0684,  1.2734, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2188, -5.9375, -0.9883,  3.3750, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -2.5781,  2.7969,  0.3242, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8125, -0.8516,  3.0625,  0.7148, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -3.1250,  1.8672,  1.0234, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9688, -3.4531,  2.5000,  0.0439, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:31,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.21 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-2.8750,  1.2266,  2.4844, -3.0625, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:32,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:44:32,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.38 | bwd_microstep: 1.68 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.56
[2025-11-06 18:44:32,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.62 | bwd: 2.93 | bwd_inner: 1.91 | bwd_allreduce: 0.86 | step: 2.66
 69%|██████▉   | 2430/3507 [59:46<27:28,  1.53s/it]                                                   {'loss': 0.5733, 'learning_rate': 4.553738767485034e-06, 'epoch': 0.69}
 69%|██████▉   | 2430/3507 [59:46<27:28,  1.53s/it]tensor([[-7.8750, -4.9375,  0.8086,  0.1045, -5.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031,  0.2637,  2.7031,  0.1924, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -4.3438, -1.1484,  2.5000, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:32,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.53 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4844, -0.0442,  2.8125, -1.1172, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -3.7969,  0.6445,  3.3906, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -3.9844,  1.5234,  0.6953, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -3.7812, -2.5312,  1.6484, -0.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-0.6367,  3.1094,  3.4531, -1.8516, -1.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:34,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:44:34,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.07 | bwd_microstep: 1450.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1449.00 | step_microstep: 7.39
[2025-11-06 18:44:34,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.62 | bwd: 1450.80 | bwd_inner: 1.60 | bwd_allreduce: 1449.05 | step: 7.48
 69%|██████▉   | 2431/3507 [59:48<28:56,  1.61s/it]                                                   {'loss': 1.0937, 'learning_rate': 4.545993995307705e-06, 'epoch': 0.69}
 69%|██████▉   | 2431/3507 [59:48<28:56,  1.61s/it]tensor([[-3.9375, -3.2188, -0.0908,  2.1250, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -5.4375, -3.2031,  0.6875, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3125,  1.3594,  2.6562, -1.3359, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -3.8750, -0.3262,  2.1719, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -1.1953,  2.7344, -0.8125, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969,  2.2656,  3.3906, -2.6719, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719, -3.6094, -1.2734,  2.7969, -0.7070]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:44:35,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.33 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1562, -3.8438, -0.2559,  2.9844, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:37,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:44:37,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.51 | bwd_microstep: 2.42 | bwd_inner_microstep: 1.51 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.85
[2025-11-06 18:44:37,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.82 | bwd: 3.41 | bwd_inner: 2.42 | bwd_allreduce: 0.86 | step: 1.93
 69%|██████▉   | 2432/3507 [59:51<34:58,  1.95s/it]                                                   {'loss': 0.5608, 'learning_rate': 4.538253876869801e-06, 'epoch': 0.69}
 69%|██████▉   | 2432/3507 [59:51<34:58,  1.95s/it]tensor([[-4.6562, -1.4375,  1.6094, -1.5391, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -1.7031,  2.4219,  0.5312, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -3.9844, -1.4844,  2.1406, -1.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:37,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.64 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-8.1875, -4.9688,  1.1172, -0.5898, -6.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5781, -0.4316,  3.0625,  0.1187, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.6836,  3.2344,  4.1875, -0.9844, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -4.0312, -0.3945,  1.4922, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.1602,  2.4688,  1.9062, -1.1641, -0.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:44:37,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:44:37,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.31 | bwd_microstep: 46.69 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 45.50 | step_microstep: 1.95
[2025-11-06 18:44:37,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 444.98 | bwd: 47.51 | bwd_inner: 1.81 | bwd_allreduce: 45.55 | step: 2.04
 69%|██████▉   | 2433/3507 [59:51<27:19,  1.53s/it]                                                   {'loss': 0.6618, 'learning_rate': 4.530518418775734e-06, 'epoch': 0.69}
 69%|██████▉   | 2433/3507 [59:51<27:19,  1.53s/it]tensor([[-5.0000, -3.7812,  0.3223,  1.8594, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -4.1875, -0.3555, -0.3086, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3438, -6.1562, -1.0312,  1.5859, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.0000, -4.7812,  0.7812, -0.9844, -6.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:38,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.92 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0938, -0.7148,  1.9062,  0.1147, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -0.7891,  3.1250, -1.2656, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -3.4062, -0.1494,  2.3125, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4688, -3.1094, -1.9922,  1.5625, -0.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:44:38,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:44:38,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 260.71 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 259.72 | step_microstep: 2.34
[2025-11-06 18:44:38,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.92 | bwd: 261.57 | bwd_inner: 1.69 | bwd_allreduce: 259.76 | step: 2.41
 69%|██████▉   | 2434/3507 [59:52<22:56,  1.28s/it]                                                   {'loss': 0.638, 'learning_rate': 4.522787627625932e-06, 'epoch': 0.69}
 69%|██████▉   | 2434/3507 [59:52<22:56,  1.28s/it]tensor([[-5.7500, -3.6094,  1.6094,  1.9766, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:38,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.66 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.9375, -6.0938, -0.1357, -0.8633, -6.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.8906,  0.1885,  2.7031, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -4.2500, -0.6797,  1.4609, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6328,  0.7227,  1.9062, -0.0400, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.8281,  0.4609,  3.0469, -0.3281, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.8125, -0.0410,  2.5312, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -4.8125, -0.9922,  1.5078, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:42,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.21 | optimizer_step: 0.34
[2025-11-06 18:44:42,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.52 | bwd_microstep: 3189.86 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 3188.30 | step_microstep: 2.78
[2025-11-06 18:44:42,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.19 | bwd: 3190.92 | bwd_inner: 2.42 | bwd_allreduce: 3188.36 | step: 2.86
 69%|██████▉   | 2435/3507 [59:55<35:06,  1.97s/it]                                                   {'loss': 0.377, 'learning_rate': 4.515061510016859e-06, 'epoch': 0.69}
 69%|██████▉   | 2435/3507 [59:55<35:06,  1.97s/it]tensor([[-8.1875, -6.0312, -0.9219, -0.3594, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:42,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.42 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.2188, -5.5625, -1.1797,  1.9531, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -4.9375, -0.5859,  3.0000, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -4.1250,  0.2539,  2.5625, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6094, -3.5000, -2.0156,  2.5938, -0.1064]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-2.4531, -3.1406, -1.3984,  2.5469, -0.2061]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -2.0625,  0.9141, -0.0825, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -3.6094, -2.5938,  1.3438, -0.4785]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:42,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:44:42,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.10 | bwd_microstep: 130.15 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 129.10 | step_microstep: 1.87
[2025-11-06 18:44:42,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.54 | bwd: 131.14 | bwd_inner: 1.86 | bwd_allreduce: 129.15 | step: 1.96
 69%|██████▉   | 2436/3507 [59:56<27:07,  1.52s/it]                                                   {'loss': 0.507, 'learning_rate': 4.507340072540969e-06, 'epoch': 0.69}
 69%|██████▉   | 2436/3507 [59:56<27:07,  1.52s/it]tensor([[-6.1562, -5.3750, -0.3398,  2.8125, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:42,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.79 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.2812, -5.9688, -1.0234,  3.1406, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.7500,  0.0211,  1.6328, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -0.7695,  3.7344, -1.6562, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -0.5469,  2.6875, -1.1094, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -1.3750,  2.5312,  1.2188, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.5312,  1.1953,  2.0781, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -5.1250, -1.6406,  2.4219, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:45,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:44:45,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.05 | bwd_microstep: 2305.77 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2304.68 | step_microstep: 1.96
[2025-11-06 18:44:45,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.87 | bwd: 2306.53 | bwd_inner: 1.69 | bwd_allreduce: 2304.71 | step: 2.03
 69%|██████▉   | 2437/3507 [59:59<33:27,  1.88s/it]                                                   {'loss': 0.1218, 'learning_rate': 4.499623321786735e-06, 'epoch': 0.69}
 69%|██████▉   | 2437/3507 [59:59<33:27,  1.88s/it]tensor([[-4.7812, -1.2969,  2.7031, -0.4434, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -5.8438, -1.8438,  2.3281, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:45,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.16 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4375, -2.7656,  1.4062,  0.2373, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -5.0000, -1.1250,  3.5000, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -4.2500,  0.5195,  0.8008, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -5.9688, -1.6406,  2.5156, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -5.0625, -0.7812,  1.5547, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -5.2188, -1.9922,  1.7656, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:45,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:44:45,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.61 | bwd_microstep: 43.86 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 42.71 | step_microstep: 1.38
[2025-11-06 18:44:45,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.79 | bwd: 44.83 | bwd_inner: 1.96 | bwd_allreduce: 42.75 | step: 1.47
 70%|██████▉   | 2438/3507 [59:59<25:48,  1.45s/it]                                                   {'loss': 0.7504, 'learning_rate': 4.491911264338625e-06, 'epoch': 0.7}
 70%|██████▉   | 2438/3507 [59:59<25:48,  1.45s/it]tensor([[-5.5000, -4.6562, -0.2930,  2.3906, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.1875,  1.1328,  1.3594, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7422, -2.1562, -1.5703,  0.9766, -0.1074]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:44:45,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.94 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3594,  1.2188,  2.6250, -1.3984, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.3281,  0.7969,  3.0000, -2.2969, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6562, -6.0625, -1.5938,  1.6875, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6562, -4.1875,  1.6875, -0.2578, -6.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6484, -1.9297, -0.8984,  1.9688,  0.0615]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:48,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:44:48,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.48 | bwd_microstep: 2184.20 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2183.10 | step_microstep: 2.05
[2025-11-06 18:44:48,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.44 | bwd: 2185.00 | bwd_inner: 1.72 | bwd_allreduce: 2183.15 | step: 2.12
 70%|██████▉   | 2439/3507 [1:00:02<32:36,  1.83s/it]                                                     {'loss': 0.5426, 'learning_rate': 4.484203906777112e-06, 'epoch': 0.7}
 70%|██████▉   | 2439/3507 [1:00:02<32:36,  1.83s/it]tensor([[-3.2031, -3.0000,  0.5117,  4.0312, -0.9180]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -5.4375, -0.6250,  1.9922, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:48,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.57 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0312, -2.3594,  0.6250, -1.0312, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -4.6875, -0.6250,  3.4062, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -2.1406,  1.4297,  1.6016, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9375, -3.7969,  0.5469,  0.3086, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7188, -4.7188,  0.9297,  1.9531, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -5.4375, -0.4961,  1.8125, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:44:48,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:44:48,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.85 | bwd_microstep: 60.37 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 59.20 | step_microstep: 1.33
[2025-11-06 18:44:48,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.45 | bwd: 61.21 | bwd_inner: 1.86 | bwd_allreduce: 59.23 | step: 1.39
 70%|██████▉   | 2440/3507 [1:00:02<25:08,  1.41s/it]                                                     {'loss': 0.2788, 'learning_rate': 4.47650125567865e-06, 'epoch': 0.7}
 70%|██████▉   | 2440/3507 [1:00:02<25:08,  1.41s/it]tensor([[-5.0312, -3.9531, -0.1025,  1.6953, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156,  0.6016,  2.6094, -0.3906, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -4.1250, -0.8672,  1.4922, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:44:49,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.13 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.9375, -0.4258,  1.7422, -2.1094, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -2.9375,  1.6016,  1.2734, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969,  1.0547,  3.9531, -1.6875, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -1.6484, -0.1001, -4.5625, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
tensor([[-4.8750, -4.0000, -0.1089,  2.3125, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:44:51,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:44:51,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.25 | bwd_microstep: 2662.78 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 2661.72 | step_microstep: 2.48
[2025-11-06 18:44:51,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.40 | bwd: 2663.57 | bwd_inner: 1.68 | bwd_allreduce: 2661.77 | step: 2.55
 70%|██████▉   | 2441/3507 [1:00:05<33:53,  1.91s/it]                                                     {'loss': 0.8723, 'learning_rate': 4.468803317615681e-06, 'epoch': 0.7}
 70%|██████▉   | 2441/3507 [1:00:05<33:53,  1.91s/it]tensor([[-1.2109,  2.1094,  2.4219, -1.6875, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.6328,  2.4531,  3.6094, -1.9844, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -1.4062,  2.6094,  0.2695, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:52,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.12 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2500, -1.3359,  2.0156, -0.3086, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.9375, -3.4688, -0.6992, -5.9688, -7.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -3.6406,  0.7695,  1.8750, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.6719,  0.8828,  2.0781, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -3.1562,  2.0312, -0.3730, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:52,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:44:52,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.67 | bwd_microstep: 1.73 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.56
[2025-11-06 18:44:52,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.83 | bwd: 2.60 | bwd_inner: 1.75 | bwd_allreduce: 0.73 | step: 1.63
 70%|██████▉   | 2442/3507 [1:00:06<25:48,  1.45s/it]                                                     {'loss': 0.5582, 'learning_rate': 4.461110099156624e-06, 'epoch': 0.7}
 70%|██████▉   | 2442/3507 [1:00:06<25:48,  1.45s/it]tensor([[-5.5625, -3.0781,  1.0703,  0.6367, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -3.0469,  2.2344,  0.6406, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -3.7656,  0.2246,  3.0156, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -3.5625,  1.1953,  2.2500, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:52,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.24 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.9688, -5.4062, -0.9453,  2.1562, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -5.3438, -1.8203,  2.3125, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -4.7188, -0.7305,  2.5312, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -3.8125,  0.4844,  0.5195, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:54,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:44:54,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.87 | bwd_microstep: 1559.99 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 1559.16 | step_microstep: 1.77
[2025-11-06 18:44:54,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.13 | bwd: 1560.76 | bwd_inner: 1.45 | bwd_allreduce: 1559.19 | step: 1.84
 70%|██████▉   | 2443/3507 [1:00:08<28:25,  1.60s/it]                                                     {'loss': 0.2396, 'learning_rate': 4.453421606865869e-06, 'epoch': 0.7}
 70%|██████▉   | 2443/3507 [1:00:08<28:25,  1.60s/it]tensor([[-3.5938, -4.3125, -2.2812,  2.0469, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.9688, -2.5156,  1.4219,  0.5039, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -0.8594,  2.4688, -0.4121, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:54,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.48 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4062, -0.2598,  4.1250,  1.7188, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -1.0469,  2.4375,  0.6562, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -3.3281,  2.4062,  1.8750, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1562, -4.5625,  0.3066, -0.1592, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -3.0156,  1.6094,  2.2812, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:54,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:44:54,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.33 | bwd_microstep: 1.56 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.68 | step_microstep: 1.39
[2025-11-06 18:44:54,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.83 | bwd: 2.40 | bwd_inner: 1.57 | bwd_allreduce: 0.72 | step: 1.47
 70%|██████▉   | 2444/3507 [1:00:08<22:00,  1.24s/it]                                                     {'loss': 0.9396, 'learning_rate': 4.445737847303776e-06, 'epoch': 0.7}
 70%|██████▉   | 2444/3507 [1:00:08<22:00,  1.24s/it]tensor([[-5.1250, -3.9688,  0.1235,  1.8047, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:54,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.73 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0625, -3.6875,  0.2422,  1.6953, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -2.3438,  1.1016,  1.4219, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -2.9531,  2.6250,  0.4707, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.0625, -4.7188,  1.4922,  2.2188, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9062, -4.7188, -0.4844, -0.6875, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -3.6875,  1.1016,  3.7500, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.9062, -4.4375,  0.9336, -1.1406, -6.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:44:56,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:44:56,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.99 | bwd_microstep: 1640.63 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1639.63 | step_microstep: 1.74
[2025-11-06 18:44:56,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.75 | bwd: 1641.45 | bwd_inner: 1.66 | bwd_allreduce: 1639.67 | step: 1.81
 70%|██████▉   | 2445/3507 [1:00:10<25:50,  1.46s/it]                                                     {'loss': 0.4423, 'learning_rate': 4.438058827026667e-06, 'epoch': 0.7}
 70%|██████▉   | 2445/3507 [1:00:10<25:50,  1.46s/it]tensor([[-4.7812, -1.2031,  2.0625, -1.5078, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8594, -0.4570,  1.9219,  0.5078, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:44:56,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.95 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.0737,  3.5156,  3.2656, -1.5234, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -4.3750, -0.3906,  2.3438, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.8125, -7.0000, -0.5977,  1.2188, -5.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -0.3438,  3.2969,  0.1855, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -2.2500,  2.1719,  1.4375, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -2.8906,  0.7500, -1.0859, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:57,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:44:57,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.67 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.52
[2025-11-06 18:44:57,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.65 | bwd: 2.82 | bwd_inner: 1.80 | bwd_allreduce: 0.86 | step: 1.62
 70%|██████▉   | 2446/3507 [1:00:11<21:11,  1.20s/it]                                                     {'loss': 0.4914, 'learning_rate': 4.430384552586819e-06, 'epoch': 0.7}
 70%|██████▉   | 2446/3507 [1:00:11<21:11,  1.20s/it]tensor([[-2.8438,  0.9336,  2.0938, -2.5312, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.0938, -3.7188, -1.1719,  3.2969, -0.4980]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -5.3750, -3.3750,  1.1875, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.8125, -1.0391,  1.2734, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:44:57,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.95 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.0312,  2.4531,  3.1406, -3.0156, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.7812, -0.9727,  1.8672, -2.5469, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6250, -5.0938, -1.8438,  2.9219, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -4.6875, -1.7266,  1.7188, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:45:00,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.27 | optimizer_step: 0.36
[2025-11-06 18:45:00,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.98 | bwd_microstep: 2538.45 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 2537.00 | step_microstep: 2.79
[2025-11-06 18:45:00,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 428.97 | bwd: 2539.36 | bwd_inner: 2.17 | bwd_allreduce: 2537.05 | step: 2.86
 70%|██████▉   | 2447/3507 [1:00:14<30:46,  1.74s/it]                                                     {'loss': 0.717, 'learning_rate': 4.422715030532461e-06, 'epoch': 0.7}
 70%|██████▉   | 2447/3507 [1:00:14<30:46,  1.74s/it]tensor([[-5.3438, -5.8438, -3.8125, -0.0120, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:00,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.31 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6875, -2.4219,  1.9141,  1.6953, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -3.5625,  0.2695,  1.6328, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5312, -5.5000,  0.3516,  1.3672, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.4688, -5.2188,  0.6094,  1.0625, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -0.9727,  2.0625, -1.5625, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -0.2969,  3.2344, -1.5703, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5312, -1.4844,  3.5781, -0.5938, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:45:01,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:45:01,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.79 | bwd_microstep: 461.13 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 460.19 | step_microstep: 1.96
[2025-11-06 18:45:01,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 229.11 | bwd: 462.15 | bwd_inner: 1.81 | bwd_allreduce: 460.22 | step: 2.04
 70%|██████▉   | 2448/3507 [1:00:14<25:20,  1.44s/it]                                                     {'loss': 0.4051, 'learning_rate': 4.415050267407762e-06, 'epoch': 0.7}
 70%|██████▉   | 2448/3507 [1:00:14<25:20,  1.44s/it]tensor([[-4.4062, -4.6875, -2.1250,  1.4453, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:45:01,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.05 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5938, -4.5312, -0.6602,  3.4531, -1.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.3750,  0.8438,  0.7422, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -5.6250, -1.9453,  1.8750, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3125, -3.0000, -1.8359,  1.9062, -0.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-4.3438, -3.7031, -0.3770,  2.0938, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:0')
tensor([4], device='cuda:1')
tensor([[-4.8125, -3.0156,  0.8945,  1.1719, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -4.0938, -0.5781,  2.8594, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:45:02,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:45:02,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.78 | bwd_microstep: 1237.47 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1236.27 | step_microstep: 2.04
[2025-11-06 18:45:02,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.85 | bwd: 1238.46 | bwd_inner: 2.00 | bwd_allreduce: 1236.31 | step: 2.13
 70%|██████▉   | 2449/3507 [1:00:16<26:16,  1.49s/it]                                                     {'loss': 0.882, 'learning_rate': 4.407390269752838e-06, 'epoch': 0.7}
 70%|██████▉   | 2449/3507 [1:00:16<26:16,  1.49s/it]tensor([[-8.2500, -6.9688, -1.7344,  0.9102, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -3.0469,  0.8789,  1.6484, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6562,  1.2109,  2.0938, -2.7812, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.8438, -5.2500, -1.0625,  2.0625, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:02,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.65 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.5508,  2.9062,  2.4219, -2.0156, -1.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -2.7188,  2.9688,  1.2109, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -0.9414,  0.5547, -3.3906, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6328,  1.6250,  2.3281, -1.6797, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:45:03,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:45:03,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 119.97 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.47
[2025-11-06 18:45:03,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.63 | bwd: 2.60 | bwd_inner: 1.66 | bwd_allreduce: 0.80 | step: 1.56
 70%|██████▉   | 2450/3507 [1:00:16<20:29,  1.16s/it]                                                     {'loss': 0.5345, 'learning_rate': 4.39973504410373e-06, 'epoch': 0.7}
 70%|██████▉   | 2450/3507 [1:00:16<20:29,  1.16s/it]tensor([[-6.0938, -2.1719,  1.6016, -2.2031, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.4062, -0.3535,  2.2344, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -2.8906,  0.5508, -0.7266, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -0.7188,  4.2812, -1.5312, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:03,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.14 | bwd_microstep: 1.68 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
tensor([[-5.7500, -2.7188,  2.3438,  0.8398, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344,  0.3027,  3.8281, -0.2520, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.8438,  1.2109,  3.8906, -0.9727, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([2], device='cuda:2')
tensor([[-5.6562, -1.1797,  3.2500, -1.8750, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:45:03,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.24 | optimizer_step: 0.20
[2025-11-06 18:45:03,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.68 | bwd_microstep: 61.81 | bwd_inner_microstep: 2.44 | bwd_allreduce_microstep: 59.15 | step_microstep: 2.93
[2025-11-06 18:45:03,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 446.99 | bwd: 63.41 | bwd_inner: 3.87 | bwd_allreduce: 59.21 | step: 3.03
 70%|██████▉   | 2451/3507 [1:00:17<17:18,  1.02it/s]                                                     {'loss': 0.5697, 'learning_rate': 4.392084596992419e-06, 'epoch': 0.7}
 70%|██████▉   | 2451/3507 [1:00:17<17:18,  1.02it/s]tensor([[-6.7188, -4.3438,  0.3809,  0.2969, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.3750, -5.9688,  0.1260,  0.2070, -6.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -3.9688, -0.5586,  0.5781, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:03,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.18 | bwd_microstep: 1.85 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10
tensor([[-4.3438, -4.6250, -1.5938,  2.6406, -1.5547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -4.4375, -0.2852,  1.9219, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -3.8125,  0.7656,  1.5391, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.1250,  0.1797,  2.7031, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -0.6367,  2.4062, -0.1943, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:05,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.60 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:45:05,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.74 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.85 | step_microstep: 4.20
[2025-11-06 18:45:05,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.98 | bwd: 3.76 | bwd_inner: 2.52 | bwd_allreduce: 0.97 | step: 4.29
 70%|██████▉   | 2452/3507 [1:00:18<19:53,  1.13s/it]                                                     {'loss': 0.4005, 'learning_rate': 4.384438934946801e-06, 'epoch': 0.7}
 70%|██████▉   | 2452/3507 [1:00:18<19:53,  1.13s/it]tensor([[1.2031, 2.3594, 4.8750, 6.0312, 1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:05,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.90 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.1562, -4.4688, -1.8359,  2.0156, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -2.0625, -0.1523, -3.1094, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
tensor([[-1.3906,  2.0000,  1.7188, -2.5938, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -3.0781,  2.2031,  0.4844, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969, -3.0000,  1.1172,  5.0312, -0.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -5.7812, -1.5547,  1.4062, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -2.7500,  0.9922,  2.8750, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:45:07,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:45:07,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.12 | bwd_microstep: 1899.72 | bwd_inner_microstep: 1.84 | bwd_allreduce_microstep: 1897.74 | step_microstep: 3.46
[2025-11-06 18:45:07,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.03 | bwd: 1900.60 | bwd_inner: 2.68 | bwd_allreduce: 1897.76 | step: 3.53
 70%|██████▉   | 2453/3507 [1:00:21<25:47,  1.47s/it]                                                     {'loss': 1.1076, 'learning_rate': 4.376798064490683e-06, 'epoch': 0.7}
 70%|██████▉   | 2453/3507 [1:00:21<25:47,  1.47s/it]tensor([[-6.3750, -5.9062, -1.7656,  1.9375, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:07,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.24 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4375, -4.8750, -1.3984,  3.2656, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0859,  1.8672,  2.0156, -1.3828, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.7812, -2.3750,  2.0781,  1.6641, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0234,  1.9453,  3.2500,  0.3887, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -3.7344,  1.4766,  2.0938, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -2.6875, -0.4551,  1.8047, -1.3047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -1.2656,  1.7969, -0.5820, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:09,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:45:09,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.02 | bwd_microstep: 2.12 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.71
[2025-11-06 18:45:09,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.29 | bwd: 2.93 | bwd_inner: 1.78 | bwd_allreduce: 0.99 | step: 7.79
 70%|██████▉   | 2454/3507 [1:00:22<27:07,  1.55s/it]                                                     {'loss': 0.6767, 'learning_rate': 4.36916199214379e-06, 'epoch': 0.7}
 70%|██████▉   | 2454/3507 [1:00:22<27:07,  1.55s/it]tensor([[-4.0312, -0.8633,  3.0625,  0.5117, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -4.4375, -0.5273,  2.7031, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.5156,  0.5117,  2.2812, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:09,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.97 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0938, -5.1562,  0.4824,  1.4688, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -2.9688,  1.9141,  1.1719, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.5938,  2.0469,  1.7812, -0.7422, -1.0234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.0000, -2.9531,  0.8906,  0.8594, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5938, -1.6719,  2.3750,  0.4629, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:45:09,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:45:09,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.59 | bwd_microstep: 429.55 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 428.45 | step_microstep: 2.76
[2025-11-06 18:45:09,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.59 | bwd: 430.23 | bwd_inner: 1.57 | bwd_allreduce: 428.49 | step: 2.85
 70%|███████   | 2455/3507 [1:00:23<23:34,  1.34s/it]                                                     {'loss': 0.4405, 'learning_rate': 4.3615307244217595e-06, 'epoch': 0.7}
 70%|███████   | 2455/3507 [1:00:23<23:34,  1.34s/it]tensor([[-4.2812, -0.7773,  1.5703, -1.7266, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9531, -0.2500,  2.3594, -1.6172, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:10,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.87 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.52 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.25
tensor([[-2.2500,  0.3809,  4.2188,  2.7031, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -4.2500, -1.0703,  2.6406, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -2.7812,  1.7812,  2.7344, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2500, -4.4375,  1.7500,  1.2031, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -1.1250,  3.6406,  0.2500, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3828,  2.4219,  3.0000, -2.4375, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:45:13,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.17 | optimizer_step: 0.24
[2025-11-06 18:45:13,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.47 | bwd_microstep: 2.57 | bwd_inner_microstep: 1.47 | bwd_allreduce_microstep: 0.99 | step_microstep: 3.24
[2025-11-06 18:45:13,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.39 | bwd: 4.38 | bwd_inner: 3.00 | bwd_allreduce: 1.11 | step: 3.48
 70%|███████   | 2456/3507 [1:00:27<33:52,  1.93s/it]                                                     {'loss': 0.657, 'learning_rate': 4.353904267836121e-06, 'epoch': 0.7}
 70%|███████   | 2456/3507 [1:00:27<33:52,  1.93s/it]tensor([[-2.7812,  1.4297,  3.3750, -2.0625, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -4.8125, -1.5781,  3.1875, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -4.6875, -2.6875,  1.7344, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:45:13,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.48 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5938, -2.8906, -0.7500,  2.9062, -0.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -4.7812, -0.4844,  3.3438, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -3.6719, -0.2578,  2.9375, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -5.2500, -0.9648,  3.1406, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -3.9531,  0.1416,  0.3945, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:45:13,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.18 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:45:13,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.82 | bwd_microstep: 17.90 | bwd_inner_microstep: 2.51 | bwd_allreduce_microstep: 15.30 | step_microstep: 8.09
[2025-11-06 18:45:13,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.31 | bwd: 18.56 | bwd_inner: 3.08 | bwd_allreduce: 15.34 | step: 8.17
 70%|███████   | 2457/3507 [1:00:27<26:02,  1.49s/it]                                                     {'loss': 0.4952, 'learning_rate': 4.3462826288943e-06, 'epoch': 0.7}
 70%|███████   | 2457/3507 [1:00:27<26:02,  1.49s/it]tensor([[-3.8125, -4.5000, -2.3594,  1.8438, -1.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -3.8281, -0.1138,  3.3438, -1.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:13,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.76 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -3.5156, -0.7344,  1.0781, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -0.5742,  2.7656, -0.8242, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -4.3125, -0.6406,  2.3594, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -2.3594,  2.9688, -0.3105, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9688, -1.5312,  1.4375,  0.2207, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219,  0.3516,  3.6719, -0.2471, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:16,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.98 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:45:16,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 278.00 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.91 | step_microstep: 5.51
[2025-11-06 18:45:16,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 441.80 | bwd: 2.70 | bwd_inner: 1.59 | bwd_allreduce: 0.95 | step: 5.61
 70%|███████   | 2458/3507 [1:00:30<32:09,  1.84s/it]                                                     {'loss': 0.4811, 'learning_rate': 4.3386658140996114e-06, 'epoch': 0.7}
 70%|███████   | 2458/3507 [1:00:30<32:09,  1.84s/it]tensor([[-2.8750,  0.6133,  2.2969, -1.4375, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:16,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.95 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3750, -0.5820,  3.7031,  0.2354, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -5.6875, -2.2031,  2.8906, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.7500,  0.4902,  3.0625, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -4.0625, -2.4219,  0.9609, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -3.5781,  0.3652,  2.7031, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8125, -4.1562,  0.4512,  1.5469, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4922,  3.2031,  3.2969, -1.6172, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:45:17,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:45:17,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.32 | bwd_microstep: 853.77 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 852.76 | step_microstep: 8.89
[2025-11-06 18:45:17,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.29 | bwd: 854.48 | bwd_inner: 1.52 | bwd_allreduce: 852.81 | step: 8.97
 70%|███████   | 2459/3507 [1:00:31<29:02,  1.66s/it]                                                     {'loss': 0.2048, 'learning_rate': 4.331053829951256e-06, 'epoch': 0.7}
 70%|███████   | 2459/3507 [1:00:31<29:02,  1.66s/it]tensor([[-5.0938, -2.7344,  1.3281,  0.8633, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -1.0703,  2.2656,  0.1738, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:17,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.14 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6875, -3.4688,  0.8555,  2.9062, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -2.3750,  1.3203,  1.1719, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-9.4375, -6.8750, -1.3359, -1.4219, -7.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -4.5625,  0.7109,  2.0781, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -4.4688, -0.7617,  1.7266, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4062,  2.0781,  3.6719, -2.4531, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:18,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:45:18,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.65 | bwd_microstep: 2.18 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.94 | step_microstep: 9.84
[2025-11-06 18:45:18,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.81 | bwd: 2.85 | bwd_inner: 1.71 | bwd_allreduce: 0.98 | step: 9.92
 70%|███████   | 2460/3507 [1:00:32<23:30,  1.35s/it]                                                     {'loss': 0.757, 'learning_rate': 4.323446682944309e-06, 'epoch': 0.7}
 70%|███████   | 2460/3507 [1:00:32<23:30,  1.35s/it]tensor([[-3.6094, -2.7344,  0.4414,  2.2344, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:18,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 98.87 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.3750, -3.5469,  0.1338,  2.1406, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0312, -2.7812,  1.2891,  1.2422, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -5.2188, -1.1953,  3.2500, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.4375, -1.0156,  2.2344, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875,  0.1934,  3.5625, -3.1719, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -4.1875, -0.4648,  2.9375, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -3.7031,  0.5508,  4.0312, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:45:21,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:45:21,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 52.92 | bwd_microstep: 3290.75 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 3289.60 | step_microstep: 2.39
[2025-11-06 18:45:21,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 151.79 | bwd: 3291.67 | bwd_inner: 1.80 | bwd_allreduce: 3289.67 | step: 2.49
 70%|███████   | 2461/3507 [1:00:35<34:38,  1.99s/it]                                                     {'loss': 0.3974, 'learning_rate': 4.3158443795697215e-06, 'epoch': 0.7}
 70%|███████   | 2461/3507 [1:00:35<34:38,  1.99s/it]tensor([[-5.8438, -4.6875, -0.2236,  2.0625, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -2.3125,  1.4375,  0.3105, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -5.2188, -0.9805,  2.6719, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4375, -0.8555,  0.8125,  0.0610, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:22,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 311.85 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -5.2812, -0.9492,  2.3438, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -5.2500, -1.4297,  3.3594, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9375, -5.0625,  0.7773,  1.9297, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -2.4531,  1.9219,  0.7383, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:22,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:45:22,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.54 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.21
[2025-11-06 18:45:22,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 483.43 | bwd: 2.53 | bwd_inner: 1.56 | bwd_allreduce: 0.82 | step: 2.29
 70%|███████   | 2462/3507 [1:00:36<27:02,  1.55s/it]                                                     {'loss': 0.4402, 'learning_rate': 4.308246926314307e-06, 'epoch': 0.7}
 70%|███████   | 2462/3507 [1:00:36<27:02,  1.55s/it]tensor([[-0.7461,  2.7969,  3.7031, -0.9688, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -3.9844,  0.1504,  2.2344, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:22,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.15 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9375, -4.3125, -0.3848,  2.4531, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -5.5938, -1.7500,  3.2188, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3750, -6.5312, -1.1797, -0.0400, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -3.7344, -0.0583,  2.0469, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8125, -4.6562,  0.5938,  1.1172, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -3.1250,  0.7930,  2.4062, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:45:24,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.36
[2025-11-06 18:45:24,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.38 | bwd_microstep: 2228.72 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 2227.29 | step_microstep: 2.62
[2025-11-06 18:45:24,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.54 | bwd: 2229.65 | bwd_inner: 2.12 | bwd_allreduce: 2227.35 | step: 2.72
 70%|███████   | 2463/3507 [1:00:38<32:23,  1.86s/it]                                                     {'loss': 0.606, 'learning_rate': 4.300654329660755e-06, 'epoch': 0.7}
 70%|███████   | 2463/3507 [1:00:38<32:23,  1.86s/it]tensor([[-10.6250,  -7.3125,  -2.6562,  -4.6562,  -8.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:24,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.48 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -2.5625,  1.1641,  0.7148, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2031,  0.3145,  2.3125, -1.6797, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -1.7266,  2.8750, -0.1768, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -4.0938, -0.7695,  2.4531, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.5000, -0.3809,  1.8281, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -4.4062, -0.5312,  3.2812, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -3.0156,  1.1406,  0.3906, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:45:25,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:45:25,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.23 | bwd_microstep: 18.73 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 17.57 | step_microstep: 2.28
[2025-11-06 18:45:25,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.72 | bwd: 19.69 | bwd_inner: 1.94 | bwd_allreduce: 17.61 | step: 2.37
 70%|███████   | 2464/3507 [1:00:39<26:30,  1.53s/it]                                                     {'loss': 0.1819, 'learning_rate': 4.293066596087587e-06, 'epoch': 0.7}
 70%|███████   | 2464/3507 [1:00:39<26:30,  1.53s/it]tensor([[-2.1250, -0.8281,  1.7734,  2.6250, -0.9727]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -3.6406,  0.9219,  0.4824, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -4.3438, -1.3281,  2.3906, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:25,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.03 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.5000, -7.8438, -3.2344, -0.0117, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -1.1797,  2.2656, -0.0752, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3594, -4.4062, -3.5938,  0.5664, -0.8398]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.0625,  1.0312,  4.7500, -2.1562, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2188, -2.3594,  1.2578,  1.2266, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:45:29,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:45:29,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.62 | bwd_microstep: 3372.65 | bwd_inner_microstep: 15.46 | bwd_allreduce_microstep: 3357.09 | step_microstep: 2.00
[2025-11-06 18:45:29,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.68 | bwd: 3373.51 | bwd_inner: 16.24 | bwd_allreduce: 3357.14 | step: 2.09
 70%|███████   | 2465/3507 [1:00:43<38:18,  2.21s/it]                                                     {'loss': 0.4415, 'learning_rate': 4.2854837320691956e-06, 'epoch': 0.7}
 70%|███████   | 2465/3507 [1:00:43<38:18,  2.21s/it]tensor([[-3.7500,  0.1611,  3.3125, -1.2344, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -3.3281,  1.3359,  1.1719, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -3.8125,  0.6562, -1.5078, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:29,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.39 | bwd_microstep: 0.59 | bwd_inner_microstep: 0.50 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.0000, -3.8438,  1.0859,  1.2422, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438,  0.0967,  2.5000, -3.0000, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -2.6875,  2.6250,  1.7031, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -2.6875, -0.0947,  0.2559, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -2.8750,  1.2266,  1.4766, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:45:30,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:45:30,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.48 | bwd_microstep: 275.91 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 275.17 | step_microstep: 1.69
[2025-11-06 18:45:30,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.89 | bwd: 276.49 | bwd_inner: 1.17 | bwd_allreduce: 275.20 | step: 1.76
 70%|███████   | 2466/3507 [1:00:43<30:14,  1.74s/it]                                                     {'loss': 0.443, 'learning_rate': 4.277905744075804e-06, 'epoch': 0.7}
 70%|███████   | 2466/3507 [1:00:43<30:14,  1.74s/it]tensor([[-1.2969,  0.4785,  0.5898, -0.2012, -1.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-8.0625, -6.0938, -0.0820,  1.0469, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -1.3438,  3.1406, -1.0000, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -2.4375,  2.2188,  0.8945, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:30,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.00 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6562, -4.7188, -0.8164,  3.2969, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -4.4688, -0.6211,  2.2969, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -3.0156, -0.3496,  1.8281, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[0.1104, 1.3047, 3.8750, 4.4062, 0.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:45:30,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:45:30,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.50 | bwd_microstep: 142.37 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 141.64 | step_microstep: 1.96
[2025-11-06 18:45:30,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.52 | bwd: 143.10 | bwd_inner: 1.29 | bwd_allreduce: 141.69 | step: 2.04
 70%|███████   | 2467/3507 [1:00:44<24:04,  1.39s/it]                                                     {'loss': 0.954, 'learning_rate': 4.27033263857349e-06, 'epoch': 0.7}
 70%|███████   | 2467/3507 [1:00:44<24:04,  1.39s/it]tensor([[-6.5938, -3.9844,  2.2344,  2.0469, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:30,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.65 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.3281,  1.9766,  3.4844,  0.2266, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-1.3672,  1.8594,  3.5156, -0.0942, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -1.7969,  2.3594, -0.2715, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438,  0.1196,  3.0938, -3.1406, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.5625,  1.5625,  3.2969, -2.0000, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -5.0000, -0.6641,  2.1719, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -2.0312,  2.0781,  1.0078, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:31,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:45:31,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.40 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.31
[2025-11-06 18:45:31,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.08 | bwd: 2.76 | bwd_inner: 1.71 | bwd_allreduce: 0.91 | step: 2.39
 70%|███████   | 2468/3507 [1:00:45<20:06,  1.16s/it]                                                     {'loss': 1.1069, 'learning_rate': 4.262764422024157e-06, 'epoch': 0.7}
 70%|███████   | 2468/3507 [1:00:45<20:06,  1.16s/it]tensor([[-4.2500, -0.3379,  2.5000, -1.9531, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -2.7969,  1.0781,  1.5703, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:31,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.49 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7812, -3.7969, -0.0981,  3.7969, -1.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -4.7188,  1.0391,  2.0000, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5938,  1.3906,  3.7812,  0.8555, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -5.9062, -2.5781,  1.9297, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -3.1719,  0.8203,  1.4141, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -0.0938,  4.1875, -0.7109, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:45:32,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.27 | optimizer_step: 0.34
[2025-11-06 18:45:32,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.71 | bwd_microstep: 1175.68 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 1174.66 | step_microstep: 2.59
[2025-11-06 18:45:32,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.22 | bwd: 1176.51 | bwd_inner: 1.61 | bwd_allreduce: 1174.72 | step: 2.68
 70%|███████   | 2469/3507 [1:00:46<22:10,  1.28s/it]                                                     {'loss': 0.4507, 'learning_rate': 4.255201100885529e-06, 'epoch': 0.7}
 70%|███████   | 2469/3507 [1:00:46<22:10,  1.28s/it]tensor([[-5.3438, -4.4688, -0.3828,  2.2344, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -0.4727,  1.8516, -0.9141, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0469, -1.6328,  1.3047,  4.1250, -0.2168]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:33,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.19 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.7188, -3.9062, -0.9375,  2.9062, -1.1953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -3.7969,  1.8672,  3.6406, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.8594, -0.0284,  2.0625, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -0.5039,  3.6406, -0.1699, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3750, -2.4531,  0.9492,  3.0156, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:35,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:45:35,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.70 | bwd_microstep: 2.72 | bwd_inner_microstep: 1.75 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.19
[2025-11-06 18:45:35,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.91 | bwd: 3.74 | bwd_inner: 2.65 | bwd_allreduce: 0.94 | step: 2.29
 70%|███████   | 2470/3507 [1:00:49<30:47,  1.78s/it]                                                     {'loss': 0.1013, 'learning_rate': 4.247642681611161e-06, 'epoch': 0.7}
 70%|███████   | 2470/3507 [1:00:49<30:47,  1.78s/it]tensor([[-2.9688, -3.4844, -2.7344,  0.3320, -0.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-7.7188, -4.8438,  1.3438,  0.7188, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -5.2500, -1.0078,  2.3125, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -4.2500,  0.1885,  3.1875, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:35,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.16 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.6875,  0.3223,  3.1875, -1.7891, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5938, -1.2734,  1.4141,  0.6055, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.5000, -6.1562,  0.5156,  1.1328, -5.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -3.1562, -0.2520, -1.2031, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:45:36,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:45:36,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.87 | bwd_microstep: 31.96 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 30.84 | step_microstep: 1.87
[2025-11-06 18:45:36,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 545.08 | bwd: 33.05 | bwd_inner: 1.94 | bwd_allreduce: 30.90 | step: 1.97
 70%|███████   | 2471/3507 [1:00:50<24:48,  1.44s/it]                                                     {'loss': 0.6042, 'learning_rate': 4.240089170650433e-06, 'epoch': 0.7}
 70%|███████   | 2471/3507 [1:00:50<24:48,  1.44s/it]tensor([[-7.2188, -6.7188, -1.7422,  2.0625, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -3.0000, -0.1562,  3.0938, -0.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -2.7656,  1.5156,  0.8711, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2188, -2.2969,  0.9023,  2.4375, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -1.4766,  2.8750,  1.1406, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:37,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.18 | bwd_microstep: 1.90 | bwd_inner_microstep: 1.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5938, -4.2812,  0.2178,  2.1562, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -5.2500,  0.6367,  1.1250, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438,  0.0801,  3.9219, -1.7422, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:45:38,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:45:38,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.78 | bwd_microstep: 258.72 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 257.88 | step_microstep: 2.34
[2025-11-06 18:45:38,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 513.00 | bwd: 260.62 | bwd_inner: 2.52 | bwd_allreduce: 257.93 | step: 2.42
 70%|███████   | 2472/3507 [1:00:52<27:33,  1.60s/it]                                                     {'loss': 0.2755, 'learning_rate': 4.232540574448524e-06, 'epoch': 0.7}
 70%|███████   | 2472/3507 [1:00:52<27:33,  1.60s/it]tensor([[-3.7188, -3.9844, -0.9766,  2.9375, -1.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781, -2.5000,  0.2412,  1.4844, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000, -3.6406, -1.6641,  2.3594, -0.6055]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -3.9219,  1.6406,  0.7734, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:38,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.82 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -0.8828,  2.1406, -2.4375, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -2.0156,  1.3594,  0.2471, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -3.4531,  1.2031,  2.4688, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -3.3750,  0.9180,  2.4219, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:45:39,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:45:39,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.97 | bwd_microstep: 649.71 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 648.84 | step_microstep: 1.99
[2025-11-06 18:45:39,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.82 | bwd: 650.36 | bwd_inner: 1.33 | bwd_allreduce: 648.88 | step: 2.08
 71%|███████   | 2473/3507 [1:00:53<24:43,  1.43s/it]                                                     {'loss': 0.5435, 'learning_rate': 4.224996899446425e-06, 'epoch': 0.71}
 71%|███████   | 2473/3507 [1:00:53<24:43,  1.43s/it]tensor([[-4.4375, -1.5938,  1.6641, -0.2285, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -5.0312, -1.7188,  2.8281, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6172,  2.0469,  2.4531, -2.3438, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1562,  0.7305,  3.2656, -1.4453, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -4.5938, -0.8164,  3.2500, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688,  0.1719,  2.7812, -1.1250, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -1.8438,  2.7344,  0.6445, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:40,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.47 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -3.6875, -0.4121,  1.9844, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:40,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:45:40,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.13 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.00
[2025-11-06 18:45:40,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.61 | bwd: 2.38 | bwd_inner: 1.41 | bwd_allreduce: 0.82 | step: 2.08
 71%|███████   | 2474/3507 [1:00:54<24:45,  1.44s/it]                                                     {'loss': 0.4584, 'learning_rate': 4.217458152080927e-06, 'epoch': 0.71}
 71%|███████   | 2474/3507 [1:00:54<24:45,  1.44s/it]tensor([[-1.3750,  0.8984,  1.9609, -0.2715, -1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -0.3711,  1.9766, -0.9023, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656,  0.3926,  3.3125, -0.4062, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -1.5312,  2.9062, -0.2520, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8750,  1.6328,  2.5781, -1.6016, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6562, -4.9375, -1.7812,  2.3281, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:41,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 40.61 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -2.5156,  1.2422,  1.1641, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -2.3438,  1.4375,  1.0156, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:45:42,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:45:42,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.91 | bwd_microstep: 1257.61 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1256.71 | step_microstep: 2.14
[2025-11-06 18:45:42,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 258.46 | bwd: 1258.26 | bwd_inner: 1.35 | bwd_allreduce: 1256.76 | step: 2.22
 71%|███████   | 2475/3507 [1:00:56<28:10,  1.64s/it]                                                     {'loss': 0.9287, 'learning_rate': 4.209924338784617e-06, 'epoch': 0.71}
 71%|███████   | 2475/3507 [1:00:56<28:10,  1.64s/it]tensor([[-1.6641,  2.1250,  2.9219, -2.2344, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.1875, -1.1250,  3.1719, -1.2188, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -2.8906,  2.0000,  3.1719, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4062, -3.1406,  0.5312, -4.1875, -6.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -4.4062,  1.1328,  1.5156, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -1.5859,  3.0312, -2.0938, -5.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -5.0938, -0.2969,  2.2969, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:44,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.06 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0312, -4.4688, -0.3477,  2.6875, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:44,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:45:44,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.22 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.73
[2025-11-06 18:45:44,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.26 | bwd: 2.36 | bwd_inner: 1.40 | bwd_allreduce: 0.82 | step: 2.81
 71%|███████   | 2476/3507 [1:00:58<30:23,  1.77s/it]                                                     {'loss': 0.2728, 'learning_rate': 4.20239546598587e-06, 'epoch': 0.71}
 71%|███████   | 2476/3507 [1:00:58<30:23,  1.77s/it]tensor([[-2.2812,  1.9844,  2.9844, -2.5469, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -4.7500, -1.1094,  2.8750, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:45,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.41 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -3.6719, -0.4043,  2.2812, -1.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -4.5625, -2.3906,  1.6953, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125,  0.2969,  3.5469, -2.5312, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2656, -3.2188, -1.5547,  2.9219,  0.1533]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -4.2812, -0.2402,  3.2031, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-3.7344, -4.2812, -1.9297,  2.2344, -1.1172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:1')
tensor([4], device='cuda:3')
[2025-11-06 18:45:45,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:45:45,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.24 | bwd_microstep: 92.05 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 90.97 | step_microstep: 2.06
[2025-11-06 18:45:45,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.68 | bwd: 92.76 | bwd_inner: 1.58 | bwd_allreduce: 91.02 | step: 2.14
 71%|███████   | 2477/3507 [1:00:59<23:29,  1.37s/it]                                                     {'loss': 0.5076, 'learning_rate': 4.194871540108849e-06, 'epoch': 0.71}
 71%|███████   | 2477/3507 [1:00:59<23:29,  1.37s/it]tensor([[-4.3125, -2.1719,  1.0547,  0.5000, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -0.4727,  3.5625, -1.7188, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -4.0938,  1.5391,  3.6250, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -2.6562,  0.8555,  3.5312, -1.1953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0938,  2.4375,  3.4688, -2.8750, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.3750, -4.4062, -0.5039,  3.7969, -1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8438,  2.8125,  2.1719, -2.5469, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:45:47,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.33 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -3.5625,  0.3789,  3.5156, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:47,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:45:47,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.18 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 2.09
[2025-11-06 18:45:47,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.51 | bwd: 2.55 | bwd_inner: 1.64 | bwd_allreduce: 0.78 | step: 2.16
 71%|███████   | 2478/3507 [1:01:01<27:29,  1.60s/it]                                                     {'loss': 0.3909, 'learning_rate': 4.187352567573489e-06, 'epoch': 0.71}
 71%|███████   | 2478/3507 [1:01:01<27:29,  1.60s/it]tensor([[-4.0625, -4.2500, -1.1797,  2.6250, -1.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -2.8281,  0.7656,  1.7969, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:47,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.54 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.3750, -4.7812, -0.8945,  1.8984, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -5.7812, -2.2656,  2.4688, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.8438, -4.8438,  0.8672, -0.5273, -6.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3906,  0.6641,  2.3594, -0.9961, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812, -1.5000,  1.1016,  1.9141, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3594, -2.7812,  0.4160,  4.9062,  0.1060]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:45:49,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:45:49,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.17 | bwd_microstep: 1667.79 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1666.59 | step_microstep: 2.03
[2025-11-06 18:45:49,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.74 | bwd: 1668.60 | bwd_inner: 1.80 | bwd_allreduce: 1666.64 | step: 2.11
 71%|███████   | 2479/3507 [1:01:03<29:55,  1.75s/it]                                                     {'loss': 0.1655, 'learning_rate': 4.179838554795515e-06, 'epoch': 0.71}
 71%|███████   | 2479/3507 [1:01:03<29:55,  1.75s/it]tensor([[-4.1562, -4.2812, -1.4297,  2.2656, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -1.3984,  2.2500,  0.6562, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219, -1.9688,  0.5352,  0.6953, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -4.4375, -0.6289,  1.6328, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -4.3438, -1.0234,  2.6875, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -1.4531,  1.4609, -1.6406, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -1.7578,  2.2344, -0.9492, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:50,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 65.94 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -3.7031,  1.4922,  2.6094, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:50,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:45:50,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.65 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.58
[2025-11-06 18:45:50,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 215.62 | bwd: 2.40 | bwd_inner: 1.42 | bwd_allreduce: 0.84 | step: 2.67
 71%|███████   | 2480/3507 [1:01:04<24:48,  1.45s/it]                                                     {'loss': 0.3523, 'learning_rate': 4.172329508186396e-06, 'epoch': 0.71}
 71%|███████   | 2480/3507 [1:01:04<24:48,  1.45s/it]tensor([[-4.9375, -2.4688,  1.8281,  1.4062, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:50,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.58 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0000, -2.2656,  2.1875,  0.8086, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1562, -6.0938, -0.6445,  2.2969, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6875, -3.2812, -0.3965,  4.4375, -0.0669]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -5.2812, -2.5469,  1.0391, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -4.3750, -2.0156,  2.5625, -0.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7969, -1.1328,  2.2969,  0.7734, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -1.2422,  3.3750, -0.0464, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:45:52,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:45:52,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.91 | bwd_microstep: 1431.32 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1430.15 | step_microstep: 1.94
[2025-11-06 18:45:52,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.50 | bwd: 1432.17 | bwd_inner: 1.84 | bwd_allreduce: 1430.19 | step: 2.03
 71%|███████   | 2481/3507 [1:01:05<26:22,  1.54s/it]                                                     {'loss': 0.2029, 'learning_rate': 4.164825434153381e-06, 'epoch': 0.71}
 71%|███████   | 2481/3507 [1:01:05<26:22,  1.54s/it]tensor([[-3.5000, -3.7969, -1.0469,  2.9375, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -1.0078,  3.2812,  0.0625, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.6562,  0.0723,  2.7812, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -0.7227,  2.7812,  0.7227, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7656,  0.9453,  3.1406, -1.1250, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -3.6250,  0.4180,  2.3438, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.9844, -0.1035,  2.0625, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:54,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.81 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.1250e+00, -4.4062e+00, -1.6708e-03,  8.1250e-01, -4.0938e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:54,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.24 | optimizer_step: 0.29
[2025-11-06 18:45:54,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.77 | bwd_microstep: 2.69 | bwd_inner_microstep: 1.58 | bwd_allreduce_microstep: 1.01 | step_microstep: 3.16
[2025-11-06 18:45:54,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.60 | bwd: 3.75 | bwd_inner: 2.50 | bwd_allreduce: 1.06 | step: 3.24
 71%|███████   | 2482/3507 [1:01:08<30:17,  1.77s/it]                                                     {'loss': 0.135, 'learning_rate': 4.157326339099467e-06, 'epoch': 0.71}
 71%|███████   | 2482/3507 [1:01:08<30:17,  1.77s/it]tensor([[ 0.1719,  3.2500,  3.0156, -0.8906, -0.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -2.8281,  1.6016,  0.0479, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9531,  0.4707,  2.9844, -0.6992, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656,  1.6172,  2.7812, -3.2188, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -3.1719,  0.7188,  2.5312, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:54,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.82 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.9688, -4.5000,  0.3145,  2.1250, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -4.4688,  0.8086,  2.7344, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.8594,  0.2969,  2.5938, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:54,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:45:54,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.75 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.24
[2025-11-06 18:45:54,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 431.56 | bwd: 2.54 | bwd_inner: 1.54 | bwd_allreduce: 0.85 | step: 2.32
 71%|███████   | 2483/3507 [1:01:08<23:39,  1.39s/it]                                                     {'loss': 0.7176, 'learning_rate': 4.149832229423412e-06, 'epoch': 0.71}
 71%|███████   | 2483/3507 [1:01:08<23:39,  1.39s/it]tensor([[-2.9688, -2.2031,  0.8203,  2.5156, -1.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -1.0547,  2.6719,  0.5508, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -1.1094,  2.8906, -0.4570, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -2.1094,  3.2188,  0.7422, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562,  0.5742,  4.3750, -1.9062, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -1.9766,  3.3906, -0.6055, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -5.1875, -1.7734,  2.4688, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:57,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.35 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9062, -0.1455,  3.8594, -0.0708, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:45:58,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.27 | optimizer_step: 0.22
[2025-11-06 18:45:58,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.64 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.82 | step_microstep: 25.59
[2025-11-06 18:45:58,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.02 | bwd: 2.45 | bwd_inner: 1.43 | bwd_allreduce: 0.86 | step: 25.69
 71%|███████   | 2484/3507 [1:01:11<32:43,  1.92s/it]                                                     {'loss': 0.2849, 'learning_rate': 4.142343111519712e-06, 'epoch': 0.71}
 71%|███████   | 2484/3507 [1:01:11<32:43,  1.92s/it]tensor([[-6.0625, -3.2656,  1.8594,  0.8242, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -3.2656,  0.0469,  1.8047, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -4.2812,  0.0771,  2.0938, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:45:58,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.70 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -2.6094,  0.4590, -2.7031, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6250, -4.5625,  1.8125,  0.4746, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -5.0000, -1.6562,  2.8438, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -2.4688,  1.6484,  2.0469, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -3.3750,  0.0505,  3.8906, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:45:58,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:45:58,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.35 | bwd_microstep: 74.59 | bwd_inner_microstep: 11.81 | bwd_allreduce_microstep: 62.70 | step_microstep: 16.39
[2025-11-06 18:45:58,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.09 | bwd: 75.25 | bwd_inner: 12.37 | bwd_allreduce: 62.74 | step: 16.49
 71%|███████   | 2485/3507 [1:01:12<25:25,  1.49s/it]                                                     {'loss': 0.1904, 'learning_rate': 4.1348589917786105e-06, 'epoch': 0.71}
 71%|███████   | 2485/3507 [1:01:12<25:25,  1.49s/it]tensor([[-5.5938, -5.7500, -2.3906,  1.7188, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -3.7031,  0.9688,  2.4062, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -0.9961,  3.0469, -0.8008, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000,  2.0938,  3.9844, -2.4375, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7500, -4.4375, -1.1328,  1.8359, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -4.5312, -1.6797,  1.8750, -1.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -4.0312, -0.1064,  1.2031, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:01,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.23 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4688, -0.7969,  3.6719, -0.0162, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:02,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:46:02,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.05 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.72 | step_microstep: 2.55
[2025-11-06 18:46:02,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 469.30 | bwd: 2.75 | bwd_inner: 1.88 | bwd_allreduce: 0.75 | step: 2.63
 71%|███████   | 2486/3507 [1:01:16<36:05,  2.12s/it]                                                     {'loss': 0.3412, 'learning_rate': 4.127379876586071e-06, 'epoch': 0.71}
 71%|███████   | 2486/3507 [1:01:16<36:05,  2.12s/it]tensor([[-6.3750, -3.7969,  0.7188, -0.0801, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8438, -3.1875,  2.0625, -1.1328, -5.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:02,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.85 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.8906, -0.0615,  3.2344, -0.9844, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9844,  0.0106,  2.8750, -1.7031, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094,  0.5117,  2.6562, -1.2578, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -2.8594,  1.5625,  3.0938, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6875, -3.5781,  1.0625,  1.3672, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -1.4062,  3.2500,  0.1934, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:46:02,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:46:02,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.05 | bwd_microstep: 69.95 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 68.94 | step_microstep: 2.97
[2025-11-06 18:46:02,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.92 | bwd: 70.77 | bwd_inner: 1.65 | bwd_allreduce: 68.97 | step: 3.05
 71%|███████   | 2487/3507 [1:01:16<27:44,  1.63s/it]                                                     {'loss': 0.1872, 'learning_rate': 4.119905772323809e-06, 'epoch': 0.71}
 71%|███████   | 2487/3507 [1:01:16<27:44,  1.63s/it]tensor([[-5.3125, -2.3594,  2.8594,  1.2344, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -1.1797,  2.1562,  0.9727, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -1.0312,  3.5312, -1.2656, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -4.8438, -1.6641,  2.5625, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -5.7188, -1.6641,  1.4062, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281,  0.8906,  4.1250, -1.1094, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812,  0.4746,  2.9531, -2.3594, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:04,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 115.01 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5625,  0.8789,  3.0469, -0.3633, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:04,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:46:04,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.29 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.45
[2025-11-06 18:46:04,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.33 | bwd: 2.61 | bwd_inner: 1.58 | bwd_allreduce: 0.89 | step: 2.53
 71%|███████   | 2488/3507 [1:01:18<28:45,  1.69s/it]                                                     {'loss': 0.4595, 'learning_rate': 4.112436685369248e-06, 'epoch': 0.71}
 71%|███████   | 2488/3507 [1:01:18<28:45,  1.69s/it]tensor([[-6.9375, -4.9062,  1.1016,  2.1406, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:04,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.87 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8594, -3.9062, -1.0703,  2.3125, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7812, -5.7812, -0.2188,  0.8086, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2188, -3.6250,  1.8672,  1.1016, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8164,  2.7969,  2.5625, -2.5625, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.1562, -1.4688,  2.5938, -1.0703, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7812, -4.1562,  0.2109,  1.5703, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375,  0.1338,  4.1562, -1.7891, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:46:05,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:46:05,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.43 | bwd_microstep: 214.26 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 213.03 | step_microstep: 1.53
[2025-11-06 18:46:05,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.32 | bwd: 215.26 | bwd_inner: 2.05 | bwd_allreduce: 213.07 | step: 1.61
 71%|███████   | 2489/3507 [1:01:18<23:06,  1.36s/it]                                                     {'loss': 0.3407, 'learning_rate': 4.1049726220955365e-06, 'epoch': 0.71}
 71%|███████   | 2489/3507 [1:01:18<23:06,  1.36s/it]tensor([[-5.0312, -3.5938, -0.3516,  0.5391, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.7227, -1.3672,  0.1680,  4.0312,  1.1953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.1875, -6.0312,  0.3789,  1.3125, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9062, -4.4688,  1.0469,  1.0234, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.8438, -0.7852,  2.0469, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9609,  1.4297,  3.0781, -0.6914, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -2.7812,  0.6172,  2.9688, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:07,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.78 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2812, -2.3750,  1.2109, -0.8594, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:07,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:46:07,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.00 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.81 | step_microstep: 2.39
[2025-11-06 18:46:07,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.80 | bwd: 2.51 | bwd_inner: 1.49 | bwd_allreduce: 0.86 | step: 2.48
 71%|███████   | 2490/3507 [1:01:21<27:01,  1.59s/it]                                                     {'loss': 0.2527, 'learning_rate': 4.0975135888715316e-06, 'epoch': 0.71}
 71%|███████   | 2490/3507 [1:01:21<27:01,  1.59s/it]tensor([[-6.8750, -3.1875,  1.7109, -1.3047, -5.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844, -3.9375, -2.2500,  2.2969, -0.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -4.6250, -1.5625,  2.2500, -1.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0625, -4.4062,  0.6523,  2.0625, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:07,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.86 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.7656,  1.0078,  4.0938,  0.2344, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -4.5625, -0.4883,  2.0625, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -4.8750, -0.4336,  3.1875, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -1.9609,  2.1406, -0.5469, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:07,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.29
[2025-11-06 18:46:07,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.11 | bwd_microstep: 2.47 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 1.13 | step_microstep: 1.93
[2025-11-06 18:46:07,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.01 | bwd: 3.51 | bwd_inner: 2.14 | bwd_allreduce: 1.19 | step: 2.02
 71%|███████   | 2491/3507 [1:01:21<21:10,  1.25s/it]                                                     {'loss': 0.0825, 'learning_rate': 4.090059592061811e-06, 'epoch': 0.71}
 71%|███████   | 2491/3507 [1:01:21<21:10,  1.25s/it]tensor([[-4.6562, -4.8750, -2.1250,  1.8516, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.7812, -5.0000,  0.8633,  0.0234, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -4.2188, -0.6250,  2.5938, -2.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -5.5312, -0.6797,  1.0859, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -5.0938, -1.8984,  2.0312, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4375,  1.2969,  2.7344, -1.7344, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -4.5938, -1.0391,  2.9062, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:11,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.34 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7188, -1.7812,  2.6250,  0.9336, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:11,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:46:11,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.02 | bwd_microstep: 1.99 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.86 | step_microstep: 3.17
[2025-11-06 18:46:11,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.38 | bwd: 2.99 | bwd_inner: 1.95 | bwd_allreduce: 0.89 | step: 3.25
 71%|███████   | 2492/3507 [1:01:25<35:06,  2.08s/it]                                                     {'loss': 0.1367, 'learning_rate': 4.0826106380266395e-06, 'epoch': 0.71}
 71%|███████   | 2492/3507 [1:01:25<35:06,  2.08s/it]tensor([[-7.0938, -3.9844,  0.6406, -1.4375, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:11,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.98 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07
tensor([[-6.5938, -5.6250, -0.0801,  3.0000, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -3.8281,  0.8789,  1.7891, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -0.2812,  3.4375, -0.4668, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -2.7344,  0.8789,  1.9766, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.9062, -5.7500, -0.3262,  2.3594, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -4.2188, -0.2236,  1.8516, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -4.0938,  0.4961,  3.2656, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:12,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:46:12,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.18 | bwd_microstep: 24.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 24.02 | step_microstep: 1.37
[2025-11-06 18:46:12,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.16 | bwd: 25.59 | bwd_inner: 1.37 | bwd_allreduce: 24.06 | step: 1.44
 71%|███████   | 2493/3507 [1:01:25<26:21,  1.56s/it]                                                     {'loss': 0.7319, 'learning_rate': 4.075166733121985e-06, 'epoch': 0.71}
 71%|███████   | 2493/3507 [1:01:25<26:21,  1.56s/it]tensor([[-3.4219, -0.8281,  3.2969,  2.0156, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -1.2188,  1.7109, -1.2109, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1562, -2.2344,  0.4980,  4.3125,  0.0544]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -5.1250, -0.7812,  1.8281, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -4.9688, -1.0312,  2.6719, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9609, -2.7500, -1.5234,  2.6406,  0.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.5312, -4.4062,  0.3496,  2.7188, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:13,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.44 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7500, -3.8281, -0.0894,  2.1875, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:13,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:46:13,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.11 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.37
[2025-11-06 18:46:13,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.58 | bwd: 2.55 | bwd_inner: 1.58 | bwd_allreduce: 0.83 | step: 2.45
 71%|███████   | 2494/3507 [1:01:27<26:39,  1.58s/it]                                                     {'loss': 0.4003, 'learning_rate': 4.067727883699508e-06, 'epoch': 0.71}
 71%|███████   | 2494/3507 [1:01:27<26:39,  1.58s/it]tensor([[-1.8203,  2.4219,  4.0312, -1.6484, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -2.6562,  1.7969, -0.4785, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:13,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.42 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2656,  1.7891,  3.5938, -1.6875, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -2.7812,  2.8906, -0.3809, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9062, -3.6406, -1.8984,  2.2500, -0.4512]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -3.7500, -2.1406,  1.4688, -0.9180]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4531,  1.1406,  3.6250, -2.5469, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -2.6562,  1.2656, -0.1143, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:46:14,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:46:14,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.18 | bwd_microstep: 102.91 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 101.83 | step_microstep: 2.00
[2025-11-06 18:46:14,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.61 | bwd: 103.60 | bwd_inner: 1.57 | bwd_allreduce: 101.88 | step: 2.08
 71%|███████   | 2495/3507 [1:01:27<20:54,  1.24s/it]                                                     {'loss': 0.1299, 'learning_rate': 4.060294096106561e-06, 'epoch': 0.71}
 71%|███████   | 2495/3507 [1:01:27<20:54,  1.24s/it]tensor([[-3.4844, -1.1172,  2.5000,  1.1406, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3691,  3.5000,  3.1250, -2.5312, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.3125, -4.5312, -0.8594,  1.5312, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -5.2188, -1.9922,  2.4062, -1.9609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000,  0.3477,  3.1562, -1.1875, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -4.3125, -1.7344,  2.5938, -1.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -4.8125, -0.3418,  2.4531, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:17,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.51 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.3750, -6.1875, -1.9922,  1.7891, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:17,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.25 | optimizer_step: 0.23
[2025-11-06 18:46:17,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.36 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.29
[2025-11-06 18:46:17,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.87 | bwd: 2.83 | bwd_inner: 1.74 | bwd_allreduce: 0.93 | step: 2.37
 71%|███████   | 2496/3507 [1:01:31<30:56,  1.84s/it]                                                     {'loss': 0.1407, 'learning_rate': 4.05286537668617e-06, 'epoch': 0.71}
 71%|███████   | 2496/3507 [1:01:31<30:56,  1.84s/it]tensor([[-4.0312, -3.8750, -1.1875,  1.7500, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -4.7500, -2.5625,  1.4844, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.0625, -4.5938, -2.4688,  1.7109, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5547, -2.5312, -1.0859,  3.4219,  0.7461]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:17,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.00 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7188, -4.2812, -0.5430,  2.3281, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -4.7500, -0.5625,  2.8594, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3340, -1.2656, -1.6094,  1.4297,  1.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-6.7188, -4.7812,  0.8984,  1.7422, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:17,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:46:17,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.40 | bwd_microstep: 117.65 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 116.62 | step_microstep: 1.69
[2025-11-06 18:46:17,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.43 | bwd: 118.57 | bwd_inner: 1.75 | bwd_allreduce: 116.66 | step: 1.78
 71%|███████   | 2497/3507 [1:01:31<24:07,  1.43s/it]                                                     {'loss': 1.0542, 'learning_rate': 4.0454417317770334e-06, 'epoch': 0.71}
 71%|███████   | 2497/3507 [1:01:31<24:07,  1.43s/it]tensor([[-6.4688, -3.2031,  1.4219, -0.6641, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -2.6250,  0.3945,  1.7344, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4688, -6.0000, -0.6641,  3.2500, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9766, -1.2969,  1.4375,  5.8125,  1.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.2188,  0.1914,  2.0000, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -1.9844,  1.3047, -1.5781, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -4.4375, -0.7969,  1.7031, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:19,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.22 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -4.5625, -0.9180,  3.1562, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:19,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:46:19,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.93 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.41
[2025-11-06 18:46:19,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.16 | bwd: 2.70 | bwd_inner: 1.71 | bwd_allreduce: 0.87 | step: 2.50
 71%|███████   | 2498/3507 [1:01:33<26:13,  1.56s/it]                                                     {'loss': 0.4597, 'learning_rate': 4.038023167713522e-06, 'epoch': 0.71}
 71%|███████   | 2498/3507 [1:01:33<26:13,  1.56s/it]tensor([[-5.0312, -0.4922,  4.0938, -1.2500, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:19,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.81 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0469,  1.0547,  3.5156, -1.6328, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -3.1406,  1.9062, -0.3320, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9375, -3.5312,  2.5938,  0.2148, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844, -3.9531, -2.4688,  2.0781, -0.3945]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -4.5312, -0.7578,  2.3281, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2188, -4.0312,  1.8359,  2.4531, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.5000, -5.5000,  0.4590, -0.9492, -6.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:46:20,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:46:20,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.72 | bwd_microstep: 155.03 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 154.03 | step_microstep: 1.57
[2025-11-06 18:46:20,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.56 | bwd: 155.95 | bwd_inner: 1.74 | bwd_allreduce: 154.07 | step: 1.64
 71%|███████▏  | 2499/3507 [1:01:34<20:41,  1.23s/it]                                                     {'loss': 0.139, 'learning_rate': 4.030609690825682e-06, 'epoch': 0.71}
 71%|███████▏  | 2499/3507 [1:01:34<20:41,  1.23s/it]tensor([[-3.0312, -4.0312, -3.4062,  0.6016, -0.6055]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.0000, -5.4062, -2.3750,  1.8750, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -4.8438, -0.2295,  3.4531, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3438,  1.7188,  3.6250,  0.9805, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -4.3750, -1.0391,  2.5625, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -3.3594, -1.2188,  3.2656, -0.1416]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250,  0.0496,  4.2188, -1.6172, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:22,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.80 | bwd_microstep: 2.71 | bwd_inner_microstep: 2.52 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12
tensor([[-4.9688, -2.5156,  2.0156,  1.5938, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:22,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:46:22,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.49 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.52
[2025-11-06 18:46:22,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.31 | bwd: 4.37 | bwd_inner: 3.32 | bwd_allreduce: 0.87 | step: 2.65
 71%|███████▏  | 2500/3507 [1:01:36<28:04,  1.67s/it]                                                     {'loss': 0.2997, 'learning_rate': 4.0232013074392065e-06, 'epoch': 0.71}
 71%|███████▏  | 2500/3507 [1:01:36<28:04,  1.67s/it]tensor([[-4.9688, -4.4062, -0.3867,  2.5781, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:23,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.21 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3750,  0.7422,  2.2031, -1.2812, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5625, -6.1562, -3.1094,  1.4844, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -2.5625,  2.3750,  0.0214, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.2656,  0.8906,  2.3906, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -5.0312, -1.9219,  2.7500, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7188, -2.3125,  2.9219,  0.4141, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0625, -4.9688,  0.3223,  1.0000, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:46:23,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:46:23,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.83 | bwd_microstep: 12.81 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 11.97 | step_microstep: 1.75
[2025-11-06 18:46:23,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.06 | bwd: 13.65 | bwd_inner: 1.49 | bwd_allreduce: 12.01 | step: 1.84
 71%|███████▏  | 2501/3507 [1:01:37<21:55,  1.31s/it]                                                     {'loss': 0.3288, 'learning_rate': 4.0157980238754465e-06, 'epoch': 0.71}
 71%|███████▏  | 2501/3507 [1:01:37<21:55,  1.31s/it]tensor([[-2.1406,  1.5312,  2.6875, -1.4922, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.4375, -2.5469,  3.0469, -0.3789, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2812, -3.1562,  0.4160,  1.6641, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -6.6562, -3.6094,  1.2656, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6641, -2.6562, -2.4531,  1.1172,  0.3301]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -5.0000,  0.5391,  2.3594, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4688, -4.0000, -2.1250,  1.5469, -1.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:46:25,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.99 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3750, -2.7969,  2.2500,  1.7031, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:25,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:46:25,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.64 | bwd_microstep: 2.24 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.47
[2025-11-06 18:46:25,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.66 | bwd: 2.92 | bwd_inner: 1.82 | bwd_allreduce: 0.93 | step: 2.55
 71%|███████▏  | 2502/3507 [1:01:39<28:14,  1.69s/it]                                                     {'loss': 0.9255, 'learning_rate': 4.008399846451402e-06, 'epoch': 0.71}
 71%|███████▏  | 2502/3507 [1:01:39<28:14,  1.69s/it]tensor([[-3.0469, -0.5508,  1.8750,  0.0845, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9375, -3.6250,  1.0391, -1.3672, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -2.6875,  1.3984,  0.5703, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312, -1.8672,  1.0469,  1.6094, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -4.0625, -0.8359,  3.5312, -0.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[4.1562, 5.0938, 6.4062, 7.3438, 4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:26,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.60 | bwd_microstep: 4.01 | bwd_inner_microstep: 3.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-3.9062, -3.7812, -0.3438,  3.2969, -1.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4688,  1.5469,  2.5469, -2.5156, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:26,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:46:26,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.63 | bwd_microstep: 109.11 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 107.98 | step_microstep: 1.86
[2025-11-06 18:46:26,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.25 | bwd: 113.11 | bwd_inner: 4.92 | bwd_allreduce: 108.03 | step: 1.95
 71%|███████▏  | 2503/3507 [1:01:40<23:41,  1.42s/it]                                                     {'loss': 0.6731, 'learning_rate': 4.001006781479715e-06, 'epoch': 0.71}
 71%|███████▏  | 2503/3507 [1:01:40<23:41,  1.42s/it]tensor([[-3.8125, -1.0078,  2.1719,  0.5195, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -2.1094,  2.3906,  0.2656, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3750,  1.4141,  2.7344, -2.0469, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:46:27,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.29 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.3809,  3.5000,  3.0312, -2.7812, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -2.5625,  2.8281, -0.7266, -5.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -1.9453,  1.7656,  1.3516, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8125, -0.8672,  2.4062,  0.1025, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -1.3359,  3.1250, -0.3535, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:27,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:46:27,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.70 | bwd_microstep: 30.42 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 29.55 | step_microstep: 1.99
[2025-11-06 18:46:27,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.00 | bwd: 31.35 | bwd_inner: 1.58 | bwd_allreduce: 29.60 | step: 2.09
 71%|███████▏  | 2504/3507 [1:01:41<19:23,  1.16s/it]                                                     {'loss': 0.6588, 'learning_rate': 3.9936188352686645e-06, 'epoch': 0.71}
 71%|███████▏  | 2504/3507 [1:01:41<19:23,  1.16s/it]tensor([[-3.7969, -2.2656,  1.3359,  2.2656, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750,  1.1875,  2.3750, -4.2188, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -3.5781,  1.7188,  1.6953, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -4.7500, -0.6914,  3.0000, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:29,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.30 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.3750, -4.0938, -0.2363, -1.0078, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2188, -6.3125, -1.7031,  0.9375, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3438, -4.0312, -2.0156,  2.1719, -0.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -4.3125,  0.4180,  2.3125, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:30,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.29
[2025-11-06 18:46:30,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.52 | bwd_microstep: 394.84 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 393.94 | step_microstep: 2.28
[2025-11-06 18:46:30,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.83 | bwd: 395.78 | bwd_inner: 1.61 | bwd_allreduce: 394.00 | step: 2.38
 71%|███████▏  | 2505/3507 [1:01:44<29:28,  1.76s/it]                                                     {'loss': 0.8307, 'learning_rate': 3.986236014122165e-06, 'epoch': 0.71}
 71%|███████▏  | 2505/3507 [1:01:44<29:28,  1.76s/it]tensor([[-5.2188, -2.1250,  1.7422, -0.6250, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -5.6250, -2.0000,  1.2969, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -1.4531,  2.8125, -1.0703, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4688, -3.9062,  1.9844,  1.6719, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0312,  0.2695,  1.9219, -1.5859, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8828,  2.5469,  3.4375, -0.7422, -1.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -2.1250,  2.7812,  0.1875, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:31,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.61 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.13
tensor([[-3.0781,  1.5000,  3.8125, -2.5156, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:31,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:46:31,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.02 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.35
[2025-11-06 18:46:31,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 441.58 | bwd: 2.93 | bwd_inner: 1.85 | bwd_allreduce: 0.91 | step: 2.46
 71%|███████▏  | 2506/3507 [1:01:45<27:42,  1.66s/it]                                                     {'loss': 0.1876, 'learning_rate': 3.978858324339752e-06, 'epoch': 0.71}
 71%|███████▏  | 2506/3507 [1:01:45<27:42,  1.66s/it]tensor([[-5.2188, -5.0625, -0.9570,  2.7812, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -0.4941,  1.0859, -3.0000, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.5312, -5.5000, -1.2109,  0.8477, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -5.0312, -0.5273,  2.1719, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -4.3125,  0.0610,  0.2910, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -3.5938,  0.4922,  1.8203, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:32,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.61 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.19
tensor([[-1.7656,  1.5938,  3.1875, -0.4688, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.5938, -0.3281,  2.7500, -0.7500, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:33,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:46:33,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.78 | bwd_microstep: 1188.92 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1187.96 | step_microstep: 2.00
[2025-11-06 18:46:33,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.42 | bwd: 1190.03 | bwd_inner: 1.89 | bwd_allreduce: 1188.00 | step: 2.19
 71%|███████▏  | 2507/3507 [1:01:47<29:31,  1.77s/it]                                                     {'loss': 1.1246, 'learning_rate': 3.971485772216595e-06, 'epoch': 0.71}
 71%|███████▏  | 2507/3507 [1:01:47<29:31,  1.77s/it]tensor([[-3.9688, -0.0679,  3.3750, -0.8750, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:34,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.35 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.9844, -2.0781,  1.1875,  1.2422, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -4.0312,  0.9062,  2.6094, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -1.4297,  3.1406, -1.6562, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5156,  0.7070,  1.7344, -1.7500, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -3.8125, -0.5938,  0.4941, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -1.9297,  2.9062, -0.5312, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -5.1875, -0.7422,  2.9531, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:46:34,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:46:34,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.86 | bwd_microstep: 147.90 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 146.62 | step_microstep: 1.46
[2025-11-06 18:46:34,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 272.24 | bwd: 148.81 | bwd_inner: 2.02 | bwd_allreduce: 146.66 | step: 1.54
 72%|███████▏  | 2508/3507 [1:01:48<22:54,  1.38s/it]                                                     {'loss': 0.4212, 'learning_rate': 3.964118364043463e-06, 'epoch': 0.72}
 72%|███████▏  | 2508/3507 [1:01:48<22:54,  1.38s/it]tensor([[-5.4375, -2.2500,  2.2812,  0.1045, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750, -3.1875,  0.0175,  3.3594, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.1406, -0.2246,  0.0391, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8203,  0.8516,  1.8984, -0.3496, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -0.1494,  2.2188, -3.0781, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -4.3125, -0.2197,  2.0781, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:35,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.97 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.4688, -2.6094,  1.1953,  1.5234, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2344, -3.7188, -1.0234,  3.2188, -0.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:36,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:46:36,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.73 | bwd_microstep: 912.52 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 911.39 | step_microstep: 3.09
[2025-11-06 18:46:36,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.71 | bwd: 913.53 | bwd_inner: 1.94 | bwd_allreduce: 911.44 | step: 3.18
 72%|███████▏  | 2509/3507 [1:01:50<27:47,  1.67s/it]                                                     {'loss': 0.249, 'learning_rate': 3.956756106106746e-06, 'epoch': 0.72}
 72%|███████▏  | 2509/3507 [1:01:50<27:47,  1.67s/it]tensor([[-6.8125, -5.2812,  0.6133,  2.4531, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -3.4062,  1.1953,  1.9609, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.4375, -7.1250, -3.1719,  0.4199, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:36,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.18 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2500, -2.6094,  2.1875, -1.0859, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -2.7031,  3.1562,  0.3418, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -4.5312, -0.4688,  2.5156, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.0312,  0.6875,  2.0781, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -2.0000,  1.5781,  0.8867, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:37,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.27 | optimizer_step: 0.23
[2025-11-06 18:46:37,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.83 | bwd_microstep: 133.29 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 132.38 | step_microstep: 3.09
[2025-11-06 18:46:37,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.04 | bwd: 133.94 | bwd_inner: 1.35 | bwd_allreduce: 132.43 | step: 3.16
 72%|███████▏  | 2510/3507 [1:01:51<22:00,  1.32s/it]                                                     {'loss': 0.1748, 'learning_rate': 3.949399004688435e-06, 'epoch': 0.72}
 72%|███████▏  | 2510/3507 [1:01:51<22:00,  1.32s/it]tensor([[-5.2500, -2.5938,  1.8281,  0.8828, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.0312,  1.3516,  0.7578, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.5938,  1.2344,  2.3125, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -2.4062,  2.0312, -0.6797, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:37,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.85 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4062,  0.4277,  2.1094, -2.5469, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625, -2.5469, -0.7539,  0.8789, -1.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -1.4297,  0.9805, -2.3750, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-7.3750, -5.8125, -2.2500, -1.6484, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:40,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 18:46:40,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 2425.45 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 2424.12 | step_microstep: 2.24
[2025-11-06 18:46:40,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.30 | bwd: 2426.30 | bwd_inner: 1.99 | bwd_allreduce: 2424.17 | step: 2.31
 72%|███████▏  | 2511/3507 [1:01:53<29:45,  1.79s/it]                                                     {'loss': 1.0374, 'learning_rate': 3.942047066066131e-06, 'epoch': 0.72}
 72%|███████▏  | 2511/3507 [1:01:53<29:45,  1.79s/it]tensor([[-5.6875, -4.0312, -0.4922, -0.2891, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.5938,  0.0679,  2.6719, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -1.9062,  2.7188, -0.8711, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -3.5625,  2.1250,  0.6250, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:40,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.90 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.3438, -3.3750,  1.4766,  1.5547, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -3.0000,  2.8906,  1.9062, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -6.2500, -2.4531,  2.0156, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.9688, -3.8594,  2.2031,  0.6250, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:46:40,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:46:40,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 23.78 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 22.74 | step_microstep: 2.56
[2025-11-06 18:46:40,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 482.70 | bwd: 24.74 | bwd_inner: 1.82 | bwd_allreduce: 22.78 | step: 2.65
 72%|███████▏  | 2512/3507 [1:01:54<23:33,  1.42s/it]                                                     {'loss': 0.4996, 'learning_rate': 3.9347002965130165e-06, 'epoch': 0.72}
 72%|███████▏  | 2512/3507 [1:01:54<23:33,  1.42s/it]tensor([[-4.7188, -0.8672,  2.1406, -2.1250, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -4.7812, -1.0469,  1.1875, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:40,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.19 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9219, -3.0625,  0.1172,  1.7969, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -3.3906,  1.0469, -0.0618, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -3.4219,  0.7656, -0.5078, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -2.7656,  1.4766, -2.3594, -6.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -3.2188,  0.3965, -1.2109, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.2910,  0.6289,  1.8906,  3.0000,  0.5430]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:42,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 18:46:42,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.75 | bwd_microstep: 1634.77 | bwd_inner_microstep: 9.19 | bwd_allreduce_microstep: 1625.48 | step_microstep: 2.17
[2025-11-06 18:46:42,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.97 | bwd: 1635.41 | bwd_inner: 9.74 | bwd_allreduce: 1625.52 | step: 2.24
 72%|███████▏  | 2513/3507 [1:01:56<26:27,  1.60s/it]                                                     {'loss': 0.2021, 'learning_rate': 3.9273587022978754e-06, 'epoch': 0.72}
 72%|███████▏  | 2513/3507 [1:01:56<26:27,  1.60s/it]tensor([[-6.8438, -3.4531, -0.2490, -3.0469, -5.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.7656,  0.2129,  3.1250, -1.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -1.8281,  2.4531,  0.8594, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:42,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.44 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.9219,  2.5781,  3.5938, -2.5938, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -3.2500, -0.0659,  2.1250, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6875, -4.0938,  0.6172, -0.3848, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.7148,  2.1875,  1.4609, -1.8984, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.3438, -0.5469,  2.7344, -1.0703, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:46:43,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:46:43,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 101.60 | bwd_microstep: 130.30 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 129.13 | step_microstep: 1.43
[2025-11-06 18:46:43,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.05 | bwd: 131.18 | bwd_inner: 1.89 | bwd_allreduce: 129.16 | step: 1.50
 72%|███████▏  | 2514/3507 [1:01:56<20:44,  1.25s/it]                                                     {'loss': 0.2048, 'learning_rate': 3.920022289685057e-06, 'epoch': 0.72}
 72%|███████▏  | 2514/3507 [1:01:56<20:44,  1.25s/it]tensor([[-0.8164,  0.1670,  2.0781,  2.9844,  0.0500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -2.3438,  0.4121,  0.1279, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -4.0938, -0.9180,  3.5781, -0.8789]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3438, -4.7188,  1.4375,  1.0234, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.9688, -6.2812,  0.0977,  2.0156, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:43,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.43 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.5312, -4.0938,  0.2207, -0.5742, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -4.3125,  0.5117,  1.1875, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -4.1250, -0.0247,  2.1562, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:46,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.30 | optimizer_step: 0.24
[2025-11-06 18:46:46,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.29 | bwd_microstep: 2418.62 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 2417.74 | step_microstep: 2.39
[2025-11-06 18:46:46,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 516.75 | bwd: 2419.33 | bwd_inner: 1.37 | bwd_allreduce: 2417.81 | step: 2.49
 72%|███████▏  | 2515/3507 [1:01:59<29:17,  1.77s/it]                                                     {'loss': 0.6029, 'learning_rate': 3.912691064934513e-06, 'epoch': 0.72}
 72%|███████▏  | 2515/3507 [1:01:59<29:17,  1.77s/it]tensor([[-4.5312, -4.9688, -2.0156,  2.2031, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:46,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.12 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5625, -4.8125, -1.7344,  2.2812, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -4.1875, -0.9102,  3.4219, -1.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.5469,  0.1289,  3.0000, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -5.7812, -1.5234,  0.5000, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2500, -5.2812, -1.6641,  2.4844, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -5.1250, -1.6562,  2.3906, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -2.6719,  1.8984,  1.1875, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:46:46,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:46:46,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.94 | bwd_microstep: 205.04 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 204.05 | step_microstep: 1.58
[2025-11-06 18:46:46,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.08 | bwd: 205.95 | bwd_inner: 1.71 | bwd_allreduce: 204.10 | step: 1.66
 72%|███████▏  | 2516/3507 [1:02:00<23:06,  1.40s/it]                                                     {'loss': 0.1833, 'learning_rate': 3.905365034301754e-06, 'epoch': 0.72}
 72%|███████▏  | 2516/3507 [1:02:00<23:06,  1.40s/it]tensor([[-0.9805,  2.1719,  2.2031, -2.0156, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.3438, -4.4375, -0.6797,  3.5000, -1.5391]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -1.9141,  2.2031,  1.7344, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:46,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.44 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.5312, -2.3594,  1.4844,  1.2422, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[h264 @ 0x92192c0] mmco: unref short failure
tensor([[-3.8438, -4.6562, -2.1406,  2.7812, -0.9336]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188e+00, -4.2188e+00,  5.0964e-03,  2.4688e+00, -2.8438e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3438, -2.7500,  2.7812, -0.2041, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -1.6250,  2.0625, -1.4297, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:47,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:46:47,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.68 | bwd_microstep: 556.89 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 555.88 | step_microstep: 1.99
[2025-11-06 18:46:47,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 522.16 | bwd: 557.76 | bwd_inner: 1.70 | bwd_allreduce: 555.91 | step: 2.06
 72%|███████▏  | 2517/3507 [1:02:01<21:45,  1.32s/it]                                                     {'loss': 0.262, 'learning_rate': 3.898044204037861e-06, 'epoch': 0.72}
 72%|███████▏  | 2517/3507 [1:02:01<21:45,  1.32s/it]tensor([[-4.1875, -3.2656, -0.0170,  1.9141, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4609,  2.9844,  3.9375, -2.4219, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4375, -2.5781,  1.8125,  2.2344, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:47,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.73 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
[h264 @ 0xc2549c0] mmco: unref short failure
tensor([[-0.5898,  2.9688,  2.6094, -2.3594, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.9375, -4.5938,  0.4043,  0.2285, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -1.9453,  1.3672, -0.1963, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -4.5625, -1.8281,  1.8516, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -4.1875, -0.1270,  2.9062, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:46:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.23 | optimizer_step: 0.24
[2025-11-06 18:46:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.89 | bwd_microstep: 27.40 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 26.34 | step_microstep: 2.15
[2025-11-06 18:46:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.65 | bwd: 28.36 | bwd_inner: 1.83 | bwd_allreduce: 26.38 | step: 2.25
 72%|███████▏  | 2518/3507 [1:02:02<17:28,  1.06s/it]                                                     {'loss': 0.5067, 'learning_rate': 3.890728580389478e-06, 'epoch': 0.72}
 72%|███████▏  | 2518/3507 [1:02:02<17:28,  1.06s/it]tensor([[-6.3125, -5.2812, -0.4238,  1.8516, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2656,  1.4844,  2.0312, -2.8438, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -0.3438,  2.8438, -1.0078, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -1.1172,  2.4531, -0.9180, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -2.6875,  1.9844,  0.9844, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.8906,  0.5039,  2.5312, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:48,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.54 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.4062, -5.1562, -0.8164,  1.1094, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1250, -5.0938, -0.7812,  1.4922, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:50,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 18:46:50,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.57 | bwd_microstep: 1274.68 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1273.48 | step_microstep: 2.40
[2025-11-06 18:46:50,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.13 | bwd: 1275.61 | bwd_inner: 1.90 | bwd_allreduce: 1273.53 | step: 2.48
 72%|███████▏  | 2519/3507 [1:02:04<23:19,  1.42s/it]                                                     {'loss': 0.4284, 'learning_rate': 3.883418169598808e-06, 'epoch': 0.72}
 72%|███████▏  | 2519/3507 [1:02:04<23:19,  1.42s/it]tensor([[-1.6328,  2.5625,  2.9688, -3.1562, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:50,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.56 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.2500, -2.9844,  2.5156,  0.6211, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969,  0.8125,  4.1250, -2.0625, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -4.4375,  0.6133,  3.5469, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.3750,  1.5938,  0.9531, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -4.0000,  0.6680,  3.2656, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.1094,  1.5391,  1.5547, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438,  0.5938,  3.0312, -1.7188, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:51,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:46:51,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.54 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.96
[2025-11-06 18:46:51,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.11 | bwd: 3.04 | bwd_inner: 2.03 | bwd_allreduce: 0.88 | step: 2.04
 72%|███████▏  | 2520/3507 [1:02:05<20:05,  1.22s/it]                                                     {'loss': 0.4972, 'learning_rate': 3.8761129779036054e-06, 'epoch': 0.72}
 72%|███████▏  | 2520/3507 [1:02:05<20:05,  1.22s/it]tensor([[-6.8750, -3.3906,  2.2969,  0.3301, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -2.1875,  1.6875, -0.2012, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -2.7031,  2.5938, -1.3125, -6.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -5.2500, -0.7266,  1.2812, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562,  0.7891,  2.3906, -1.9062, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -3.7969,  0.2871,  2.0156, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:52,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.40 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.17
tensor([[-5.2812, -4.7812, -0.9727,  2.0625, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4375, -1.3516,  2.4375, -2.2969, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:46:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:46:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.60 | bwd_microstep: 2027.03 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2025.85 | step_microstep: 2.24
[2025-11-06 18:46:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.00 | bwd: 2028.14 | bwd_inner: 2.14 | bwd_allreduce: 2025.88 | step: 2.42
 72%|███████▏  | 2521/3507 [1:02:08<29:14,  1.78s/it]                                                     {'loss': 0.3598, 'learning_rate': 3.868813011537169e-06, 'epoch': 0.72}
 72%|███████▏  | 2521/3507 [1:02:08<29:14,  1.78s/it]tensor([[-0.2119,  3.0469,  2.1250, -2.0469, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4062, -4.2500, -0.5625,  2.9219, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -4.7500, -1.5703,  2.2969, -1.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -3.6875,  0.1631,  2.3750, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:54,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.85 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5469,  1.7969,  3.4531, -2.3281, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9688, -5.3438, -0.3613,  0.9141, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -4.4688, -0.0332,  2.9062, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -3.6250, -0.3809,  2.2969, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:54,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:46:54,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.00 | bwd_microstep: 2.08 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.92 | step_microstep: 1.77
[2025-11-06 18:46:54,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.88 | bwd: 3.02 | bwd_inner: 1.92 | bwd_allreduce: 0.96 | step: 1.86
 72%|███████▏  | 2522/3507 [1:02:08<22:22,  1.36s/it]                                                     {'loss': 0.7376, 'learning_rate': 3.861518276728341e-06, 'epoch': 0.72}
 72%|███████▏  | 2522/3507 [1:02:08<22:22,  1.36s/it]tensor([[-3.2969, -4.0938, -1.7344,  2.8594, -0.5898]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4688, -3.0781,  0.5977,  3.7812, -1.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:54,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.56 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.5312, -1.8984,  1.7734,  0.1807, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.6250, -1.3594,  3.1250, -2.0000, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -2.6406,  2.4844,  0.4473, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -3.5000,  1.0469,  2.5312, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -1.5625,  2.0625, -0.0581, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.3750, -6.7500, -1.1641,  2.7031, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:57,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.24 | optimizer_step: 0.22
[2025-11-06 18:46:57,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.25 | bwd_microstep: 2785.58 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 2784.44 | step_microstep: 3.28
[2025-11-06 18:46:57,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.84 | bwd: 2786.49 | bwd_inner: 1.85 | bwd_allreduce: 2784.49 | step: 3.38
 72%|███████▏  | 2523/3507 [1:02:11<31:22,  1.91s/it]                                                     {'loss': 0.7484, 'learning_rate': 3.854228779701498e-06, 'epoch': 0.72}
 72%|███████▏  | 2523/3507 [1:02:11<31:22,  1.91s/it]tensor([[-6.7812, -4.6875,  0.5977,  0.8125, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7188, -4.3125,  1.0078,  0.7656, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -0.7383,  1.6094, -1.3047, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:58,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.40 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3594,  0.5469,  2.0312, -3.0156, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -3.1250,  1.9531, -0.1094, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2500, -3.6875, -1.1719,  2.8906, -0.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.0312, -6.0000, -1.9688,  2.2031, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -2.7188,  1.9062, -0.3145, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:58,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:46:58,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.72 | bwd_microstep: 150.12 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 149.19 | step_microstep: 1.88
[2025-11-06 18:46:58,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.16 | bwd: 150.98 | bwd_inner: 1.59 | bwd_allreduce: 149.24 | step: 1.97
 72%|███████▏  | 2524/3507 [1:02:12<24:29,  1.50s/it]                                                     {'loss': 0.9729, 'learning_rate': 3.846944526676556e-06, 'epoch': 0.72}
 72%|███████▏  | 2524/3507 [1:02:12<24:29,  1.50s/it]tensor([[-5.9375, -5.7812, -1.8594,  1.8906, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:46:58,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.08 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0000, -1.9219,  2.2188,  0.3535, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -3.4844,  2.4688,  0.7539, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4219, -0.0425,  2.2812, -1.1328, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.2656,  1.4141,  1.5547, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -1.0391,  3.0312, -0.6875, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6250, -3.0000,  0.9258,  1.6797, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.2188,  0.8438,  1.5938, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:46:59,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:46:59,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.77 | bwd_microstep: 513.01 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 512.09 | step_microstep: 1.95
[2025-11-06 18:46:59,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 251.86 | bwd: 513.96 | bwd_inner: 1.65 | bwd_allreduce: 512.14 | step: 2.05
 72%|███████▏  | 2525/3507 [1:02:13<21:03,  1.29s/it]                                                     {'loss': 0.8678, 'learning_rate': 3.839665523868942e-06, 'epoch': 0.72}
 72%|███████▏  | 2525/3507 [1:02:13<21:03,  1.29s/it]tensor([[-7.5000, -4.4062,  0.8906, -0.4141, -5.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:46:59,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.73 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-6.3125, -4.1562,  0.7070,  0.9141, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4844,  0.5078,  2.5938, -2.6406, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -3.5312, -0.5586,  2.5781, -1.4297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -2.8594,  2.0156,  0.1484, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4844, -2.5938,  0.1953,  1.5625, -1.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8906, -0.1377,  2.1406, -2.0938, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4375,  0.3789,  2.7656,  0.1602, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:47:00,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:47:00,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.46 | bwd_microstep: 754.89 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 753.91 | step_microstep: 2.05
[2025-11-06 18:47:00,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.21 | bwd: 755.90 | bwd_inner: 1.74 | bwd_allreduce: 753.98 | step: 2.15
 72%|███████▏  | 2526/3507 [1:02:14<20:09,  1.23s/it]                                                     {'loss': 0.5405, 'learning_rate': 3.832391777489607e-06, 'epoch': 0.72}
 72%|███████▏  | 2526/3507 [1:02:14<20:09,  1.23s/it]tensor([[-4.3750, -1.0078,  2.3750, -0.5938, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2969, -0.9336,  1.6562,  4.3125,  0.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:00,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.19 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -4.0938,  0.5000,  1.5469, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -4.2812,  0.7930, -1.0703, -6.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -3.1875,  2.2344,  0.8750, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -4.7500,  0.0635,  1.8750, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.4922,  1.9453,  2.5938, -1.9297, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9688, -4.2188,  1.8594,  1.2812, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:47:00,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:47:00,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.71 | bwd_microstep: 243.97 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 242.86 | step_microstep: 2.30
[2025-11-06 18:47:00,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.93 | bwd: 244.81 | bwd_inner: 1.78 | bwd_allreduce: 242.90 | step: 2.38
 72%|███████▏  | 2527/3507 [1:02:14<17:23,  1.06s/it]                                                     {'loss': 0.8692, 'learning_rate': 3.82512329374503e-06, 'epoch': 0.72}
 72%|███████▏  | 2527/3507 [1:02:14<17:23,  1.06s/it]tensor([[-3.1250, -2.8594,  0.1992,  3.0781, -1.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9766,  1.8203,  4.3438, -0.1069, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2031,  1.6172,  3.4375, -1.7266, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:02,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.44 | bwd_microstep: 1.46 | bwd_inner_microstep: 1.32 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.1875, -4.7812,  0.1885,  1.9375, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3828, -0.6445,  2.8906,  5.3438,  0.2793]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -0.5078,  2.8281, -0.3750, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9219,  0.9375,  3.2344, -1.8438, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -3.8594,  0.1445,  2.4062, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:02,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:47:02,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.58 | bwd_microstep: 348.34 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 347.24 | step_microstep: 1.96
[2025-11-06 18:47:02,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 424.04 | bwd: 349.80 | bwd_inner: 2.34 | bwd_allreduce: 347.30 | step: 2.07
 72%|███████▏  | 2528/3507 [1:02:16<20:07,  1.23s/it]                                                     {'loss': 0.1085, 'learning_rate': 3.817860078837186e-06, 'epoch': 0.72}
 72%|███████▏  | 2528/3507 [1:02:16<20:07,  1.23s/it]tensor([[-1.8359,  1.4844,  2.9531, -1.0391, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -0.9688,  2.0781, -2.5469, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -3.2812,  1.6797,  0.7578, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.4531,  1.1562,  2.2500, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -2.4062,  0.2393,  0.9609, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -3.6875,  1.1328,  1.9922, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:03,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.09 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.0938, -4.2188,  1.4766,  2.3594, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -5.5312, -1.5469,  2.5781, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:03,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:47:03,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.18 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.97
[2025-11-06 18:47:03,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.29 | bwd: 2.76 | bwd_inner: 1.75 | bwd_allreduce: 0.88 | step: 2.06
 72%|███████▏  | 2529/3507 [1:02:17<17:22,  1.07s/it]                                                     {'loss': 0.3708, 'learning_rate': 3.8106021389635583e-06, 'epoch': 0.72}
 72%|███████▏  | 2529/3507 [1:02:17<17:22,  1.07s/it]tensor([[-7.1875, -7.4375, -3.7500,  0.6367, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -3.5625,  0.5156,  4.3438, -1.2734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -2.7812,  2.5156,  0.3984, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:04,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.37 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.7734,  0.9414,  1.8828, -0.5430, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-7.7812, -4.9375,  1.2188,  0.3848, -5.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -4.2188, -1.1328,  2.3906, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -2.2500,  2.0781,  1.0703, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.9219,  0.3398,  1.6328, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:06,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.14 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:47:06,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.82 | bwd_microstep: 1225.01 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 1223.47 | step_microstep: 2.86
[2025-11-06 18:47:06,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.22 | bwd: 1225.76 | bwd_inner: 2.08 | bwd_allreduce: 1223.53 | step: 2.95
 72%|███████▏  | 2530/3507 [1:02:19<25:39,  1.58s/it]                                                     {'loss': 0.7071, 'learning_rate': 3.8033494803171224e-06, 'epoch': 0.72}
 72%|███████▏  | 2530/3507 [1:02:19<25:39,  1.58s/it]tensor([[-4.1250, -4.5000, -1.3672,  2.9844, -1.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:06,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.47 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-14.2500, -12.3125,  -6.3125,  -4.9062, -10.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -3.4219,  0.5742,  1.8594, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -1.3828,  2.9844, -0.6016, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -4.4062, -1.4688,  2.8125, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8125, -3.6250,  0.7734,  0.6641, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0312, -0.3027,  2.1562, -2.1719, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6094, -0.3008,  3.3438,  2.4844, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:06,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:47:06,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.02 | bwd_microstep: 164.63 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 163.80 | step_microstep: 2.41
[2025-11-06 18:47:06,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 247.52 | bwd: 165.33 | bwd_inner: 1.34 | bwd_allreduce: 163.84 | step: 2.49
 72%|███████▏  | 2531/3507 [1:02:20<20:07,  1.24s/it]                                                     {'loss': 0.657, 'learning_rate': 3.7961021090863625e-06, 'epoch': 0.72}
 72%|███████▏  | 2531/3507 [1:02:20<20:07,  1.24s/it]tensor([[-4.3438, -1.9688,  1.4375,  0.1602, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -4.4062, -0.3086,  1.9141, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -4.3438, -1.2109,  2.3594, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2812, -3.9375,  2.2812,  2.7188, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:09,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.20 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0312, -2.0469,  1.9453, -0.3809, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5938, -3.4219,  2.5469,  0.8281, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -4.1250, -0.9414,  2.2344, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -4.0312,  0.3164,  2.5000, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:47:10,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:47:10,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.75 | bwd_microstep: 1047.64 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1046.62 | step_microstep: 2.18
[2025-11-06 18:47:10,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.96 | bwd: 1048.58 | bwd_inner: 1.76 | bwd_allreduce: 1046.67 | step: 2.27
 72%|███████▏  | 2532/3507 [1:02:24<34:26,  2.12s/it]                                                     {'loss': 0.1733, 'learning_rate': 3.78886003145524e-06, 'epoch': 0.72}
 72%|███████▏  | 2532/3507 [1:02:24<34:26,  2.12s/it]tensor([[-4.8438, -3.4375,  0.5664,  1.7969, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -5.7500, -2.5000,  2.5781, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:10,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.12 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.6250, -1.0859,  1.1719, -0.4746, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0938, -5.3125,  0.5391,  1.9531, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1562, -4.2812,  1.3906,  0.4941, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.9844,  0.6992,  1.9844, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6094, -4.1562, -1.3672,  3.0938, -0.8711]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4961,  2.8906,  2.3125, -2.0781, -1.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:47:11,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:47:11,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.73 | bwd_microstep: 72.72 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 71.62 | step_microstep: 6.10
[2025-11-06 18:47:11,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.89 | bwd: 73.63 | bwd_inner: 1.71 | bwd_allreduce: 71.70 | step: 6.19
 72%|███████▏  | 2533/3507 [1:02:25<26:40,  1.64s/it]                                                     {'loss': 0.5104, 'learning_rate': 3.7816232536032017e-06, 'epoch': 0.72}
 72%|███████▏  | 2533/3507 [1:02:25<26:40,  1.64s/it]tensor([[-5.1250, -5.1875, -1.1875,  3.0938, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -3.6719,  1.6328,  1.6094, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -4.6562, -0.0118,  1.2188, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:12,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.74 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0000, -3.6406,  0.3418,  1.6641, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6562,  0.9336,  2.7656, -1.2812, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3281, -3.5312, -0.6953,  2.8125, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.0000,  1.8047,  2.5000, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0312, -3.4062,  1.4766,  0.5234, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:47:14,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.29
[2025-11-06 18:47:14,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.01 | bwd_microstep: 1275.31 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1274.00 | step_microstep: 2.06
[2025-11-06 18:47:14,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.78 | bwd: 1276.24 | bwd_inner: 2.02 | bwd_allreduce: 1274.05 | step: 2.15
 72%|███████▏  | 2534/3507 [1:02:27<32:35,  2.01s/it]                                                     {'loss': 0.3636, 'learning_rate': 3.7743917817051723e-06, 'epoch': 0.72}
 72%|███████▏  | 2534/3507 [1:02:27<32:35,  2.01s/it]tensor([[-3.1250, -3.3750, -1.0781,  2.5000, -0.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -0.7266,  2.7188, -0.3809, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281, -3.0938, -0.4883,  2.4531, -1.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344,  0.0398,  2.3438, -0.8516, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688,  0.2383,  3.7031, -2.1562, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:14,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 331.05 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -1.7656,  2.4531,  1.7500, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625,  1.0859,  3.0156, -0.9258, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9844, -3.6406,  0.0942,  3.3594, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:14,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:47:14,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.91 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.44
[2025-11-06 18:47:14,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 473.98 | bwd: 3.05 | bwd_inner: 2.02 | bwd_allreduce: 0.90 | step: 2.52
 72%|███████▏  | 2535/3507 [1:02:28<25:18,  1.56s/it]                                                     {'loss': 0.4569, 'learning_rate': 3.76716562193155e-06, 'epoch': 0.72}
 72%|███████▏  | 2535/3507 [1:02:28<25:18,  1.56s/it]tensor([[-3.5156,  0.2539,  3.0312, -1.1406, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -2.4844,  1.3750,  0.6602, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -2.5312,  1.5781,  0.9414, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:15,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.71 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[3.2031, 3.8438, 5.8750, 7.3125, 3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -4.5938, -0.8594,  2.0938, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4688, -5.0000,  0.5703,  2.4062, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1250, -5.1250, -0.2891,  0.3320, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -4.1875, -0.7852,  3.2344, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:47:17,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:47:17,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.57 | bwd_microstep: 1250.68 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 1249.29 | step_microstep: 1.57
[2025-11-06 18:47:17,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.32 | bwd: 1251.66 | bwd_inner: 2.17 | bwd_allreduce: 1249.33 | step: 1.67
 72%|███████▏  | 2536/3507 [1:02:31<31:00,  1.92s/it]                                                     {'loss': 0.3151, 'learning_rate': 3.759944780448199e-06, 'epoch': 0.72}
 72%|███████▏  | 2536/3507 [1:02:31<31:00,  1.92s/it]tensor([[-5.1250, -4.5938, -1.2109,  1.6484, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -2.1562,  1.7969, -0.1138, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -3.2031,  1.4766,  3.8438, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:17,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.96 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.3438, -5.5625,  0.0962,  1.3438, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -6.9375, -1.7812,  1.8516, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -1.6797,  3.1406, -2.1875, -6.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6562, -3.8438, -0.2637,  0.0850, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281, -0.9062,  1.5547,  0.3281, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:17,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:47:17,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.53 | bwd_microstep: 36.80 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 35.79 | step_microstep: 2.26
[2025-11-06 18:47:17,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.51 | bwd: 37.51 | bwd_inner: 1.54 | bwd_allreduce: 35.83 | step: 2.35
 72%|███████▏  | 2537/3507 [1:02:31<23:52,  1.48s/it]                                                     {'loss': 0.3425, 'learning_rate': 3.7527292634164468e-06, 'epoch': 0.72}
 72%|███████▏  | 2537/3507 [1:02:31<23:52,  1.48s/it]tensor([[-5.0312, -0.4863,  3.8125, -1.2812, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -2.1562,  0.8672, -0.8906, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8281, -0.2354,  2.9531, -1.0547, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:18,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.19 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -4.9062, -0.1367,  2.0938, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8438, -6.0312, -1.2891, -0.5195, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -3.9375,  0.4277,  1.1875, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2656, -3.0781, -2.3281,  1.1094, -0.1602]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-8.0000, -6.3750, -1.0938,  0.6172, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:47:19,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.38 | optimizer_step: 0.30
[2025-11-06 18:47:19,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.57 | bwd_microstep: 1100.56 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1099.36 | step_microstep: 3.26
[2025-11-06 18:47:19,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.79 | bwd: 1101.43 | bwd_inner: 1.83 | bwd_allreduce: 1099.43 | step: 3.35
 72%|███████▏  | 2538/3507 [1:02:33<24:09,  1.50s/it]                                                     {'loss': 0.8592, 'learning_rate': 3.745519076993078e-06, 'epoch': 0.72}
 72%|███████▏  | 2538/3507 [1:02:33<24:09,  1.50s/it]tensor([[-3.2344,  0.6250,  2.9062, -1.7891, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -2.3906,  3.0625, -0.9688, -5.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -0.9844,  3.1094, -0.0186, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:19,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.24 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.3125, -4.3438,  0.4492,  0.9180, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1875, -5.8438, -0.0420,  2.5312, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.6875, -0.2051,  2.5781, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7188, -4.5938, -0.6641,  0.8125, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1562, -3.9688,  0.2949,  0.0801, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:47:19,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:47:19,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.03 | bwd_microstep: 159.90 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 159.10 | step_microstep: 2.09
[2025-11-06 18:47:19,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.30 | bwd: 160.69 | bwd_inner: 1.39 | bwd_allreduce: 159.15 | step: 2.18
 72%|███████▏  | 2539/3507 [1:02:33<19:45,  1.22s/it]                                                     {'loss': 0.612, 'learning_rate': 3.738314227330324e-06, 'epoch': 0.72}
 72%|███████▏  | 2539/3507 [1:02:33<19:45,  1.22s/it]tensor([[-4.6875, -4.6562, -0.6328,  3.5000, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562,  1.8750,  3.7656, -2.6406, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5938, -2.5781,  2.7344,  1.1719, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8203,  2.7031,  2.8750, -1.8359, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:21,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 62.47 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8438, -2.1094,  3.0781, -0.2539, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.5625, -7.1562, -1.1562,  1.1641, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -4.1250, -0.5703,  1.5859, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -3.8438,  0.6953,  1.3125, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:23,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:47:23,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.50 | bwd_microstep: 1555.01 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1553.86 | step_microstep: 1.94
[2025-11-06 18:47:23,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.98 | bwd: 1556.11 | bwd_inner: 2.06 | bwd_allreduce: 1553.92 | step: 2.02
 72%|███████▏  | 2540/3507 [1:02:36<29:12,  1.81s/it]                                                     {'loss': 0.4476, 'learning_rate': 3.7311147205758767e-06, 'epoch': 0.72}
 72%|███████▏  | 2540/3507 [1:02:36<29:12,  1.81s/it]tensor([[-6.9688, -4.4688,  1.3594,  1.2266, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9375, -6.6250, -2.1875,  1.4844, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844,  0.3574,  1.8750, -2.9375, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -4.1875, -0.3301,  2.1719, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:47:23,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.85 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.2500, -0.3281,  3.2812, -0.9609, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -3.2031,  1.0625,  1.0703, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -4.6875,  0.2285, -0.9297, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9531, -0.9766,  2.6094,  0.8164, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:47:23,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.19 | optimizer_step: 0.24
[2025-11-06 18:47:23,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.40 | bwd_microstep: 27.01 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 25.71 | step_microstep: 2.00
[2025-11-06 18:47:23,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.28 | bwd: 27.96 | bwd_inner: 2.01 | bwd_allreduce: 25.77 | step: 2.11
 72%|███████▏  | 2541/3507 [1:02:37<22:29,  1.40s/it]                                                     {'loss': 0.4324, 'learning_rate': 3.7239205628728483e-06, 'epoch': 0.72}
 72%|███████▏  | 2541/3507 [1:02:37<22:29,  1.40s/it]tensor([[-4.7188, -3.0469,  0.5586,  1.3438, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -6.5938, -3.9531,  0.5781, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:47:23,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.39 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3594, -3.0156, -1.9062,  1.6328, -0.2275]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -6.0312, -2.4062,  2.0938, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.1875, -6.1250, -0.0086,  0.9492, -5.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.4688, -3.1250, -0.5117,  4.2188,  0.0608]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5938,  1.9922,  3.3906, -2.8750, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -4.3125, -1.3984,  2.4219, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:47:25,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:47:25,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.22 | bwd_microstep: 1728.55 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1727.35 | step_microstep: 2.13
[2025-11-06 18:47:25,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.60 | bwd: 1729.55 | bwd_inner: 2.01 | bwd_allreduce: 1727.40 | step: 2.22
 72%|███████▏  | 2542/3507 [1:02:39<26:07,  1.62s/it]                                                     {'loss': 0.8044, 'learning_rate': 3.7167317603597975e-06, 'epoch': 0.72}
 72%|███████▏  | 2542/3507 [1:02:39<26:07,  1.62s/it]tensor([[-5.4062, -4.8438, -0.3574,  2.6875, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -2.2656,  1.5938,  0.4219, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875,  0.2305,  3.7188, -2.7500, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:25,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.20 | bwd_microstep: 5.62 | bwd_inner_microstep: 5.50 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0312, -1.1094,  2.2500,  0.1377, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0156,  1.7891,  2.0625, -1.2188, -1.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -3.8438,  1.1328,  2.4219, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -4.0312, -0.2002,  2.5156, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.8438, -1.5391,  2.3281, -0.3105, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([3], device='cuda:3')
[2025-11-06 18:47:26,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:47:26,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.61 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.10
[2025-11-06 18:47:26,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.83 | bwd: 7.49 | bwd_inner: 6.40 | bwd_allreduce: 0.94 | step: 2.19
 73%|███████▎  | 2543/3507 [1:02:39<20:22,  1.27s/it]                                                     {'loss': 0.5255, 'learning_rate': 3.7095483191707206e-06, 'epoch': 0.73}
 73%|███████▎  | 2543/3507 [1:02:39<20:22,  1.27s/it]tensor([[-3.6250, -0.2871,  2.5938, -0.4648, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:26,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.88 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0000, -1.8516,  1.6016,  0.8242, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.5000, -5.1875,  1.3438,  1.8750, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -2.0938,  2.0312, -0.4531, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8125, -4.1875,  0.6680,  1.8672, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -0.3242,  0.6797, -4.2500, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6250, -1.6719,  1.8516, -0.8477, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -4.0938, -0.4316,  1.9297, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:47:29,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.31 | optimizer_step: 0.28
[2025-11-06 18:47:29,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.77 | bwd_microstep: 3161.74 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 3160.70 | step_microstep: 3.08
[2025-11-06 18:47:29,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.69 | bwd: 3162.79 | bwd_inner: 1.88 | bwd_allreduce: 3160.75 | step: 3.18
 73%|███████▎  | 2544/3507 [1:02:43<31:39,  1.97s/it]                                                     {'loss': 1.1665, 'learning_rate': 3.7023702454350284e-06, 'epoch': 0.73}
 73%|███████▎  | 2544/3507 [1:02:43<31:39,  1.97s/it]tensor([[-4.9062, -0.9766,  2.7031, -1.3047, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -1.9453,  1.2578,  1.3672, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -3.0156,  1.0000,  3.5469, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -5.6250, -0.7070,  2.7969, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:29,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.79 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4062, -3.6562,  0.7891,  1.6953, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -4.1875,  1.1406,  1.0703, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -4.0312,  0.5234,  1.8672, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -2.7344,  2.5938,  0.5977, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:30,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:47:30,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.43 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.75
[2025-11-06 18:47:30,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.24 | bwd: 2.66 | bwd_inner: 1.69 | bwd_allreduce: 0.79 | step: 1.85
 73%|███████▎  | 2545/3507 [1:02:44<24:13,  1.51s/it]                                                     {'loss': 0.4031, 'learning_rate': 3.6951975452775567e-06, 'epoch': 0.73}
 73%|███████▎  | 2545/3507 [1:02:44<24:13,  1.51s/it]tensor([[-2.8438, -2.7969,  0.6172,  4.3125, -0.4941]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:30,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.28 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -1.6953,  1.2656, -1.2422, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -4.6250, -0.8008,  2.9219, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.9062, -3.7969,  0.1953, -4.2188, -7.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.1250, -0.4609,  2.4844, -1.1641, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9375, -5.6562, -1.6250,  1.9453, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -2.8906,  1.9922,  3.2656, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1875, -5.0000,  0.9570, -0.4668, -6.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:47:33,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:47:33,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.52 | bwd_microstep: 2800.25 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 2799.25 | step_microstep: 2.34
[2025-11-06 18:47:33,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.83 | bwd: 2801.21 | bwd_inner: 1.74 | bwd_allreduce: 2799.31 | step: 2.43
 73%|███████▎  | 2546/3507 [1:02:47<33:00,  2.06s/it]                                                     {'loss': 0.6015, 'learning_rate': 3.6880302248185528e-06, 'epoch': 0.73}
 73%|███████▎  | 2546/3507 [1:02:47<33:00,  2.06s/it]tensor([[-3.5312,  0.1699,  3.1875, -0.7578, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7656,  0.0339,  3.8750,  4.5938, -0.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.9375, -5.8125,  0.0786,  0.7656, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -4.7812, -1.2578,  1.8125, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:33,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.29 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.9062, -3.3750,  1.4219,  0.5820, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7500, -5.4062, -0.9062,  2.8906, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -0.1074,  2.4219, -0.1279, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6250, -6.1562, -1.2109,  2.5469, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:47:34,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:47:34,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.66 | bwd_microstep: 143.86 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 142.87 | step_microstep: 2.11
[2025-11-06 18:47:34,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.95 | bwd: 144.81 | bwd_inner: 1.71 | bwd_allreduce: 142.94 | step: 2.21
 73%|███████▎  | 2547/3507 [1:02:47<25:31,  1.60s/it]                                                     {'loss': 0.2944, 'learning_rate': 3.680868290173677e-06, 'epoch': 0.73}
 73%|███████▎  | 2547/3507 [1:02:47<25:31,  1.60s/it]tensor([[-4.1250, -4.3750, -1.4531,  2.3906, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:34,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.54 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9688, -5.6562, -2.4844,  2.4688, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -5.2500, -1.8203,  2.5625, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.9375,  0.5781,  2.6250, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -2.6562,  1.1797,  1.6484, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -2.3125,  2.2812,  1.2422, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -1.8047,  2.0625,  0.5898, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -5.6562, -1.6953,  1.8828, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:37,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:47:37,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.76 | bwd_microstep: 2573.46 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 2572.18 | step_microstep: 2.49
[2025-11-06 18:47:37,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.34 | bwd: 2574.23 | bwd_inner: 1.84 | bwd_allreduce: 2572.23 | step: 2.58
 73%|███████▎  | 2548/3507 [1:02:50<32:11,  2.01s/it]                                                     {'loss': 0.3458, 'learning_rate': 3.673711747453994e-06, 'epoch': 0.73}
 73%|███████▎  | 2548/3507 [1:02:50<32:11,  2.01s/it]tensor([[-4.8125, -3.1406,  0.8086,  1.3516, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -3.9375, -0.0192,  3.3438, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -5.3438, -1.8281,  2.2812, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -4.9688,  1.0234,  1.9844, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:37,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.05 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-3.2031, -3.0156,  0.0957,  3.3438, -0.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7188, -0.3984,  3.3281,  2.4531, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -4.5625,  1.0391,  1.7891, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -5.0938, -2.5156,  1.4219, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:37,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:47:37,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.06 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.28
[2025-11-06 18:47:37,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.12 | bwd: 2.78 | bwd_inner: 1.84 | bwd_allreduce: 0.78 | step: 1.40
 73%|███████▎  | 2549/3507 [1:02:51<24:40,  1.55s/it]                                                     {'loss': 0.2846, 'learning_rate': 3.666560602765965e-06, 'epoch': 0.73}
 73%|███████▎  | 2549/3507 [1:02:51<24:40,  1.55s/it]tensor([[-5.0000, -4.9375, -1.4453,  2.0938, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -4.9375, -2.4375,  1.6875, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:37,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.14 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1250, -5.2188, -1.6641,  2.2344, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1719, -3.8906, -1.5547,  2.8750, -0.5586]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1406,  0.4082,  1.0703, -3.2188, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.3750, -2.4219,  3.0312, -0.6758, -5.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -3.8125, -1.0078,  3.5156, -0.6289]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -3.8438, -0.7930,  3.6562, -0.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:40,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 18:47:40,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.92 | bwd_microstep: 2488.68 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 2487.61 | step_microstep: 1.78
[2025-11-06 18:47:40,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.09 | bwd: 2489.54 | bwd_inner: 1.75 | bwd_allreduce: 2487.65 | step: 1.87
 73%|███████▎  | 2550/3507 [1:02:54<31:02,  1.95s/it]                                                     {'loss': 0.1694, 'learning_rate': 3.6594148622114465e-06, 'epoch': 0.73}
 73%|███████▎  | 2550/3507 [1:02:54<31:02,  1.95s/it]tensor([[-5.4062, -3.4219,  0.9961,  1.2344, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.4609, 1.0859, 4.3438, 7.0000, 1.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.5781,  0.9336,  2.1406, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5625,  0.1738,  3.0156, -0.9648, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:40,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.83 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.32
tensor([[-3.5000,  0.8672,  4.3125, -0.7383, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -4.2812, -0.5586,  2.9062, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -2.7031,  0.8203, -0.1992, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -2.3750,  0.6641,  0.5352, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:47:41,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:47:41,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.90 | bwd_microstep: 973.78 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 972.80 | step_microstep: 1.64
[2025-11-06 18:47:41,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.75 | bwd: 974.66 | bwd_inner: 1.67 | bwd_allreduce: 972.84 | step: 1.97
 73%|███████▎  | 2551/3507 [1:02:55<28:17,  1.78s/it]                                                     {'loss': 0.8621, 'learning_rate': 3.652274531887686e-06, 'epoch': 0.73}
 73%|███████▎  | 2551/3507 [1:02:55<28:17,  1.78s/it]tensor([[-4.1250, -2.2969,  1.9766,  2.7031, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.8359,  1.3828, -0.3867, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:41,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.20 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7969, -1.4453,  1.0625, -0.3320, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1562, -4.9375, -0.6328,  1.3594, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -0.5117,  1.6172, -0.3379, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9375, -4.0625,  2.0469,  1.0859, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -2.2500,  1.6328, -0.8164, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -4.2500, -1.3047,  2.9219, -1.2109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:47:43,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:47:43,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.34 | bwd_microstep: 870.36 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 869.39 | step_microstep: 1.66
[2025-11-06 18:47:43,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.57 | bwd: 871.19 | bwd_inner: 1.63 | bwd_allreduce: 869.43 | step: 1.75
 73%|███████▎  | 2552/3507 [1:02:56<25:59,  1.63s/it]                                                     {'loss': 0.3391, 'learning_rate': 3.645139617887312e-06, 'epoch': 0.73}
 73%|███████▎  | 2552/3507 [1:02:56<25:59,  1.63s/it]tensor([[-3.4219, -3.7969, -1.1172,  3.0000, -0.8320]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:43,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.78 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2500, -3.5312,  0.3477,  1.0312, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -0.6133,  2.5156, -0.2598, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -5.0000, -2.0312,  2.7188, -1.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0312, -3.0156, -0.9023,  2.1250, -0.9258]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -0.5078,  3.0312, -0.5117, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7188, -3.8750,  1.7031,  0.8555, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1562, -0.0195,  1.7109, -1.5156, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:47:44,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:47:44,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 295.75 | bwd_microstep: 706.74 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 705.68 | step_microstep: 2.29
[2025-11-06 18:47:44,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.56 | bwd: 707.70 | bwd_inner: 1.84 | bwd_allreduce: 705.73 | step: 2.35
 73%|███████▎  | 2553/3507 [1:02:58<23:48,  1.50s/it]                                                     {'loss': 0.1652, 'learning_rate': 3.6380101262983325e-06, 'epoch': 0.73}
 73%|███████▎  | 2553/3507 [1:02:58<23:48,  1.50s/it]tensor([[-5.7812, -4.7812, -0.3730,  2.2344, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:44,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.42 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.5625,  1.0547,  3.3906, -1.0234, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.8125, -7.6562, -1.7812,  1.2656, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -2.4531,  2.3906,  1.7812, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -5.7500, -0.4941,  2.4531, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000, -2.6250,  1.1562,  2.2031, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -2.9062,  1.1562,  0.3750, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5469, -0.3750,  2.6875,  0.2002, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:47:45,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:47:45,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.51 | bwd_microstep: 964.03 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 962.65 | step_microstep: 2.00
[2025-11-06 18:47:45,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.97 | bwd: 964.94 | bwd_inner: 2.11 | bwd_allreduce: 962.69 | step: 2.07
 73%|███████▎  | 2554/3507 [1:02:59<23:21,  1.47s/it]                                                     {'loss': 0.4014, 'learning_rate': 3.6308860632041275e-06, 'epoch': 0.73}
 73%|███████▎  | 2554/3507 [1:02:59<23:21,  1.47s/it]tensor([[-2.1406,  2.0625,  3.4219, -2.0156, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:45,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.53 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.5195,  2.3906,  2.0000, -1.2188, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.8438,  2.2188,  3.3125, -2.0781, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -3.5469,  2.2188, -0.3926, -5.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -3.5312,  0.4688,  3.4688, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562,  0.1211,  4.2500, -1.9141, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6250, -0.0166,  3.0469, -1.0234, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -1.7891,  1.7031, -0.0287, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:47:46,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:47:46,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.61 | bwd_microstep: 237.05 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 236.09 | step_microstep: 3.02
[2025-11-06 18:47:46,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.17 | bwd: 237.91 | bwd_inner: 1.61 | bwd_allreduce: 236.14 | step: 3.12
 73%|███████▎  | 2555/3507 [1:03:00<19:51,  1.25s/it]                                                     {'loss': 0.1897, 'learning_rate': 3.623767434683444e-06, 'epoch': 0.73}
 73%|███████▎  | 2555/3507 [1:03:00<19:51,  1.25s/it]tensor([[-5.5000, -2.6562,  1.4453, -0.0322, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -3.5469,  0.2188,  2.7656, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:46,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.26 | bwd_microstep: 5.60 | bwd_inner_microstep: 5.49 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9062, -4.3750, -2.0312,  2.0000, -1.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -1.2266,  3.4688, -0.6875, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -4.0312, -0.5117,  2.9688, -1.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7812, -3.2188, -1.7734,  1.6172, -0.6055]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -5.7500, -1.0000,  1.8047, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -4.5938,  0.9688,  3.0625, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:47:47,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:47:47,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.79 | bwd_microstep: 1232.87 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1231.71 | step_microstep: 1.85
[2025-11-06 18:47:47,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.08 | bwd: 1238.46 | bwd_inner: 6.57 | bwd_allreduce: 1231.75 | step: 1.93
 73%|███████▎  | 2556/3507 [1:03:01<21:34,  1.36s/it]                                                     {'loss': 0.3581, 'learning_rate': 3.6166542468103982e-06, 'epoch': 0.73}
 73%|███████▎  | 2556/3507 [1:03:01<21:34,  1.36s/it]tensor([[-5.7812, -3.6875,  0.2793,  0.0928, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -1.3438,  2.1406, -0.3672, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:48,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.22 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3750, -3.2812,  1.3047,  1.5078, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -3.4219,  0.6445,  1.7812, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -2.2188,  2.7344,  0.2373, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -5.6562, -1.9531,  2.8906, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.3438, -0.3086,  2.6094, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -1.7969,  1.5469,  0.6641, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:47:50,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:47:50,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.20 | bwd_microstep: 2143.10 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 2141.67 | step_microstep: 2.32
[2025-11-06 18:47:50,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.44 | bwd: 2143.95 | bwd_inner: 2.10 | bwd_allreduce: 2141.72 | step: 2.40
 73%|███████▎  | 2557/3507 [1:03:04<26:50,  1.70s/it]                                                     {'loss': 0.3195, 'learning_rate': 3.609546505654462e-06, 'epoch': 0.73}
 73%|███████▎  | 2557/3507 [1:03:04<26:50,  1.70s/it]tensor([[-6.4688, -3.8281,  1.7031,  1.0703, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -6.4688, -3.8281,  0.3555, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -4.2812,  0.9141,  1.3750, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:50,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.43 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -3.1875,  1.2344, -0.4180, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094,  0.7773,  2.4062, -2.5469, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.1670,  2.9531,  2.5000, -0.6836, -0.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.6250, -2.8750,  2.9688,  2.1250, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -2.7656,  2.1406, -0.1318, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:47:51,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:47:51,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.83 | bwd_microstep: 165.78 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 164.75 | step_microstep: 1.96
[2025-11-06 18:47:51,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.29 | bwd: 166.83 | bwd_inner: 1.92 | bwd_allreduce: 164.79 | step: 2.04
 73%|███████▎  | 2558/3507 [1:03:04<21:26,  1.36s/it]                                                     {'loss': 0.4585, 'learning_rate': 3.602444217280445e-06, 'epoch': 0.73}
 73%|███████▎  | 2558/3507 [1:03:04<21:26,  1.36s/it]tensor([[-5.9688, -6.9062, -5.2188, -0.8203, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.8594, -0.2715,  2.4219, -1.3750, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -1.1406,  2.2344,  0.0791, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:51,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.30 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4375, -4.9375, -0.9336,  2.3906, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -2.5469,  1.6016,  2.2656, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -2.6562,  0.9141,  1.1875, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -5.5312, -1.4375,  1.7578, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -4.1250,  1.1875,  0.8750, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:47:52,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.30
[2025-11-06 18:47:52,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.63 | bwd_microstep: 1219.09 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1218.05 | step_microstep: 2.53
[2025-11-06 18:47:52,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.95 | bwd: 1219.94 | bwd_inner: 1.71 | bwd_allreduce: 1218.10 | step: 2.61
 73%|███████▎  | 2559/3507 [1:03:06<22:29,  1.42s/it]                                                     {'loss': 0.5355, 'learning_rate': 3.595347387748529e-06, 'epoch': 0.73}
 73%|███████▎  | 2559/3507 [1:03:06<22:29,  1.42s/it]tensor([[-5.8750, -5.4062, -0.6836,  2.6562, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719, -2.6406,  0.8242,  2.5625, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -5.2812, -1.3438,  2.5000, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:52,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.34 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.0625, -5.4062, -0.5664,  2.8906, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -5.1250, -0.5625,  2.1094, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.9219,  2.2188,  2.7812, -3.0469, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.2500, -1.7500,  3.0938, -0.0623, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -5.1250,  0.2441,  2.6875, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:47:54,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 18:47:54,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.72 | bwd_microstep: 1220.39 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1219.33 | step_microstep: 1.93
[2025-11-06 18:47:54,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.09 | bwd: 1221.40 | bwd_inner: 1.88 | bwd_allreduce: 1219.38 | step: 2.01
 73%|███████▎  | 2560/3507 [1:03:08<23:14,  1.47s/it]                                                     {'loss': 0.4943, 'learning_rate': 3.5882560231142205e-06, 'epoch': 0.73}
 73%|███████▎  | 2560/3507 [1:03:08<23:14,  1.47s/it]tensor([[-5.3750, -2.3750,  1.6719, -0.5039, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -1.8516,  1.2422,  0.4648, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:54,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.81 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.5312, -4.6875,  0.1084,  3.0156, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -4.0000,  0.4043,  2.8125, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -2.8281,  1.4766,  1.6641, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0625, -3.3438,  1.6719,  0.8125, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -4.7188, -1.7500,  2.4531, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -3.8438,  0.6523,  2.2656, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:47:55,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:47:55,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.44 | bwd_microstep: 1418.32 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 1416.77 | step_microstep: 2.05
[2025-11-06 18:47:55,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.26 | bwd: 1419.49 | bwd_inner: 2.50 | bwd_allreduce: 1416.81 | step: 2.14
 73%|███████▎  | 2561/3507 [1:03:09<24:33,  1.56s/it]                                                     {'loss': 0.3331, 'learning_rate': 3.5811701294283684e-06, 'epoch': 0.73}
 73%|███████▎  | 2561/3507 [1:03:09<24:33,  1.56s/it]tensor([[-5.4375, -1.8906,  3.2500,  0.1943, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:56,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.52 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.6250, -5.0625, -1.1562,  1.6641, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -4.9688, -0.0801,  1.3125, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-6.1250, -4.0625,  0.2031,  0.3105, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([2], device='cuda:2')
tensor([2], device='cuda:1')
tensor([[-3.5938, -4.1562, -1.9844,  2.0781, -0.9961]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -3.6562,  1.8203,  2.3125, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2188, -3.6406,  2.1875, -0.3945, -6.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -3.8750,  0.4766,  2.7031, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:47:57,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:47:57,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.85 | bwd_microstep: 849.33 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 848.27 | step_microstep: 1.74
[2025-11-06 18:47:57,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 459.39 | bwd: 850.44 | bwd_inner: 2.00 | bwd_allreduce: 848.30 | step: 1.81
 73%|███████▎  | 2562/3507 [1:03:11<23:33,  1.50s/it]                                                     {'loss': 0.4038, 'learning_rate': 3.574089712737152e-06, 'epoch': 0.73}
 73%|███████▎  | 2562/3507 [1:03:11<23:33,  1.50s/it]tensor([[-2.7656, -3.0312, -0.4023,  3.5312, -0.4004]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -4.9688,  0.2812,  2.4844, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:47:57,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.70 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.7188, -4.0312, -0.3066,  2.3125, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -4.5625,  0.7969,  2.7500, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7969,  0.8516,  2.8438, -1.2656, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9688, -4.0938,  0.1553,  2.7344, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -4.3125,  0.4766,  1.7969, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -4.7812, -0.4883,  2.4844, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:47:59,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:47:59,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.74 | bwd_microstep: 1538.07 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1537.07 | step_microstep: 1.73
[2025-11-06 18:47:59,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.47 | bwd: 1538.94 | bwd_inner: 1.70 | bwd_allreduce: 1537.12 | step: 1.81
 73%|███████▎  | 2563/3507 [1:03:12<25:19,  1.61s/it]                                                     {'loss': 0.111, 'learning_rate': 3.5670147790820786e-06, 'epoch': 0.73}
 73%|███████▎  | 2563/3507 [1:03:12<25:19,  1.61s/it]tensor([[-4.9688, -3.7969,  0.2031,  1.8125, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.4844,  0.1377,  2.8438, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -3.5781,  0.7266,  1.3438, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:59,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.02 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.4375, -4.1250,  0.3516,  2.1406, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938,  0.0156,  3.6562, -2.2812, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4375, -3.5469,  2.2031,  0.7969, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -3.9688, -0.2812,  3.4375, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -2.2344,  1.7656,  1.1562, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:47:59,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:47:59,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.98 | bwd_microstep: 70.25 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 68.93 | step_microstep: 1.46
[2025-11-06 18:47:59,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.03 | bwd: 71.29 | bwd_inner: 2.19 | bwd_allreduce: 68.96 | step: 1.55
 73%|███████▎  | 2564/3507 [1:03:13<19:49,  1.26s/it]                                                     {'loss': 0.4786, 'learning_rate': 3.559945334499978e-06, 'epoch': 0.73}
 73%|███████▎  | 2564/3507 [1:03:13<19:49,  1.26s/it]tensor([[-3.7656, -2.8750,  0.4355,  2.1250, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -4.4062, -1.5469,  2.4062, -1.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -2.9688,  2.5625,  0.5625, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -1.3750,  2.2344, -0.7461, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -4.3438,  0.7422,  1.5156, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:47:59,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.50 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5312, -0.3398,  3.3906, -1.3125, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -6.6562, -1.7422,  2.1719, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7188, -4.9688,  0.2217,  3.6094, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:00,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 18:48:00,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.84 | bwd_microstep: 337.81 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 336.80 | step_microstep: 2.33
[2025-11-06 18:48:00,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.38 | bwd: 338.79 | bwd_inner: 1.79 | bwd_allreduce: 336.85 | step: 2.41
 73%|███████▎  | 2565/3507 [1:03:14<17:42,  1.13s/it]                                                     {'loss': 0.211, 'learning_rate': 3.5528813850229915e-06, 'epoch': 0.73}
 73%|███████▎  | 2565/3507 [1:03:14<17:42,  1.13s/it]tensor([[-3.4844, -4.4688, -2.8125,  1.7578, -0.7617]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969, -0.6875,  1.3672,  0.6992, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:00,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.91 | bwd_microstep: 5.96 | bwd_inner_microstep: 5.79 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-6.4062, -3.2500,  2.7500,  1.2266, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -3.8438,  0.3027,  1.8828, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -3.0781, -0.2246,  0.9336, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -3.2969,  0.3281,  2.3125, -2.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3438, -5.0312, -1.5547, -2.5000, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -5.1250, -2.1719,  1.8203, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:04,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 18:48:04,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.58 | bwd_microstep: 3450.24 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 3449.37 | step_microstep: 2.81
[2025-11-06 18:48:04,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.49 | bwd: 3456.20 | bwd_inner: 6.59 | bwd_allreduce: 3449.44 | step: 2.91
 73%|███████▎  | 2566/3507 [1:03:18<30:18,  1.93s/it]                                                     {'loss': 0.9218, 'learning_rate': 3.5458229366785778e-06, 'epoch': 0.73}
 73%|███████▎  | 2566/3507 [1:03:18<30:18,  1.93s/it]tensor([[-5.0938, -4.4688, -0.1768,  2.9688, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8750, -3.3281,  2.2812, -0.2969, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:04,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.61 | bwd_microstep: 4.45 | bwd_inner_microstep: 4.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.3125, -3.6094,  1.5391,  0.7148, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -3.2500,  0.6016,  0.1328, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -3.0625,  1.9688,  2.1094, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -3.5156,  1.7344, -0.5859, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -4.9062, -1.8984,  2.1562, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -4.0312,  0.4551,  2.5781, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:04,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:48:04,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.46 | bwd_microstep: 153.44 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 152.36 | step_microstep: 2.30
[2025-11-06 18:48:04,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 451.10 | bwd: 157.89 | bwd_inner: 5.29 | bwd_allreduce: 152.42 | step: 2.41
 73%|███████▎  | 2567/3507 [1:03:18<24:18,  1.55s/it]                                                     {'loss': 0.2535, 'learning_rate': 3.538769995489494e-06, 'epoch': 0.73}
 73%|███████▎  | 2567/3507 [1:03:18<24:18,  1.55s/it]tensor([[-5.8125, -1.3984,  2.4375, -2.8281, -5.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:05,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.34 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4688, -2.3125,  0.1758, -2.7031, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7969, -3.7812, -1.4531,  1.3594, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.4844,  1.1328,  3.8438, -0.6914, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -2.5625,  1.7969,  1.4375, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -3.0156,  1.1719,  1.2422, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.3438,  0.5234,  2.6250, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -5.3750,  0.2334,  2.2344, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:48:05,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:48:05,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.40 | bwd_microstep: 74.89 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 73.83 | step_microstep: 1.78
[2025-11-06 18:48:05,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.75 | bwd: 75.86 | bwd_inner: 1.82 | bwd_allreduce: 73.88 | step: 1.87
 73%|███████▎  | 2568/3507 [1:03:19<18:59,  1.21s/it]                                                     {'loss': 0.9102, 'learning_rate': 3.531722567473813e-06, 'epoch': 0.73}
 73%|███████▎  | 2568/3507 [1:03:19<18:59,  1.21s/it]tensor([[-5.2188, -4.8125, -0.4941,  2.9219, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1250, -0.6133,  2.1094,  0.6953, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -4.9062, -1.3672,  2.7188, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -1.9219,  1.2578,  0.9219, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:05,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.21 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -2.2344,  2.0938, -0.1875, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -2.6562,  2.3281, -1.8203, -6.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -1.6406,  2.3438, -0.2949, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.3125, -2.6094,  2.2812, -0.8125, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:05,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:48:05,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.99 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.94
[2025-11-06 18:48:05,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.23 | bwd: 2.82 | bwd_inner: 1.75 | bwd_allreduce: 0.90 | step: 2.02
 73%|███████▎  | 2569/3507 [1:03:19<15:27,  1.01it/s]                                                     {'loss': 0.9315, 'learning_rate': 3.5246806586448845e-06, 'epoch': 0.73}
 73%|███████▎  | 2569/3507 [1:03:19<15:27,  1.01it/s]tensor([[-3.1719,  0.0796,  3.3750,  0.2051, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -2.5625,  2.1094, -0.3848, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:06,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.35 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -4.6562, -1.6875,  2.7344, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5938, -5.1875, -0.2656, -0.3926, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -2.9062,  1.6875,  4.1875, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -1.3672,  3.9531, -1.0312, -5.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4531,  0.8867,  3.6250, -2.0156, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9062, -3.8906, -0.6797,  3.0625, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:48:08,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:48:08,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.52 | bwd_microstep: 760.74 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 759.63 | step_microstep: 2.30
[2025-11-06 18:48:08,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.85 | bwd: 761.63 | bwd_inner: 1.82 | bwd_allreduce: 759.67 | step: 2.39
 73%|███████▎  | 2570/3507 [1:03:22<23:45,  1.52s/it]                                                     {'loss': 0.145, 'learning_rate': 3.5176442750113593e-06, 'epoch': 0.73}
 73%|███████▎  | 2570/3507 [1:03:22<23:45,  1.52s/it]tensor([[-6.3438, -5.0312, -0.2520,  1.4609, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -4.1562,  1.0859,  2.2031, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.4531,  1.1172,  3.1875, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:08,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.09 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7656, -1.0625,  1.4844,  1.2969, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0000,  2.5625,  2.8438, -2.0469, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.6562, -3.0938,  1.1094,  2.2500, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -3.5625,  0.8320,  0.9297, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -4.9688, -0.2227,  1.6484, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:08,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:48:08,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.78 | bwd_microstep: 129.51 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 128.35 | step_microstep: 1.95
[2025-11-06 18:48:08,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 245.89 | bwd: 130.17 | bwd_inner: 1.63 | bwd_allreduce: 128.39 | step: 2.03
 73%|███████▎  | 2571/3507 [1:03:22<18:30,  1.19s/it]                                                     {'loss': 0.4009, 'learning_rate': 3.510613422577169e-06, 'epoch': 0.73}
 73%|███████▎  | 2571/3507 [1:03:22<18:30,  1.19s/it]tensor([[-3.6875,  0.6406,  3.7188, -1.6719, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:09,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.96 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9219, -4.2500, -0.7656,  3.6562, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -5.0938, -1.9453,  2.5000, -1.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.7812, -5.9688, -0.1006,  1.5156, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2500,  0.8711,  2.4844, -2.5938, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6562, -0.8906,  2.5469,  2.7344, -1.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -1.8594,  3.6875,  0.5859, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5781,  0.7539,  2.3125, -1.4297, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:48:11,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:48:11,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.24 | bwd_microstep: 500.35 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 498.95 | step_microstep: 2.14
[2025-11-06 18:48:11,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.21 | bwd: 501.31 | bwd_inner: 2.17 | bwd_allreduce: 499.00 | step: 2.21
 73%|███████▎  | 2572/3507 [1:03:24<22:42,  1.46s/it]                                                     {'loss': 0.1709, 'learning_rate': 3.503588107341538e-06, 'epoch': 0.73}
 73%|███████▎  | 2572/3507 [1:03:24<22:42,  1.46s/it]tensor([[-5.8750, -5.3125, -1.0312,  1.9219, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:11,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.32 | bwd_microstep: 1.16 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-3.1719, -3.0938, -0.4258,  2.4844, -1.1484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -2.5312,  1.6328, -1.9531, -5.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6875, -2.0000,  3.0938, -1.8828, -6.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6562, -4.4688,  0.7969, -1.0078, -6.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6094, -4.3750, -2.5938,  1.6406, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4375, -3.6250,  1.7734,  0.7578, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -1.3984,  2.6875,  0.9414, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:48:12,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:48:12,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.37 | bwd_microstep: 814.30 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 812.66 | step_microstep: 2.01
[2025-11-06 18:48:12,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.74 | bwd: 815.48 | bwd_inner: 2.45 | bwd_allreduce: 812.75 | step: 2.15
 73%|███████▎  | 2573/3507 [1:03:26<21:28,  1.38s/it]                                                     {'loss': 0.2438, 'learning_rate': 3.496568335298953e-06, 'epoch': 0.73}
 73%|███████▎  | 2573/3507 [1:03:26<21:28,  1.38s/it]tensor([[-4.2500, -3.0156,  0.5000,  1.8438, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531, -2.7188, -0.0645,  2.8594, -0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1875, -5.8750, -0.0513,  2.4844, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -1.8906,  2.2812,  4.0625, -1.4609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:12,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.10 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15
tensor([[-4.7188, -3.2344,  0.5039,  1.3516, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7031, -3.9531, -1.1875,  2.5625, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5469,  0.5469,  2.8594, -1.9766, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -4.2812, -0.1514,  2.5625, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:48:13,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:48:13,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.78 | bwd_microstep: 849.63 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 848.34 | step_microstep: 2.24
[2025-11-06 18:48:13,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.85 | bwd: 850.62 | bwd_inner: 1.98 | bwd_allreduce: 848.42 | step: 2.40
 73%|███████▎  | 2574/3507 [1:03:27<22:54,  1.47s/it]                                                     {'loss': 0.369, 'learning_rate': 3.4895541124391663e-06, 'epoch': 0.73}
 73%|███████▎  | 2574/3507 [1:03:27<22:54,  1.47s/it]tensor([[-2.7969, -2.6875,  0.7266,  4.2188, -0.5820]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:14,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.91 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4219, -3.0781,  0.1973,  3.2812, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -3.7188, -2.1406,  1.8672, -0.5664]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5469,  0.1758,  1.4531, -0.7539, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.1250, -3.0938,  0.8086,  2.9844, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3281, -2.3438,  1.3672,  3.5625, -1.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -5.1875, -0.8984,  1.8750, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0625, -2.8750, -1.0391,  3.1719,  0.2275]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:15,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:48:15,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.83 | bwd_microstep: 1478.03 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1476.86 | step_microstep: 2.20
[2025-11-06 18:48:15,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.77 | bwd: 1478.93 | bwd_inner: 1.88 | bwd_allreduce: 1476.91 | step: 2.29
 73%|███████▎  | 2575/3507 [1:03:29<24:24,  1.57s/it]                                                     {'loss': 0.5492, 'learning_rate': 3.482545444747214e-06, 'epoch': 0.73}
 73%|███████▎  | 2575/3507 [1:03:29<24:24,  1.57s/it]tensor([[-6.4375, -2.9375,  1.5469, -1.2969, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -4.2188,  0.6680,  3.1562, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -0.9844,  2.5938, -1.0078, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:15,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.49 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1719,  1.1406,  2.6094, -1.4219, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -3.8594,  0.6641,  0.2305, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -2.0312,  2.2969,  1.7500, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.3789,  2.4844,  1.9453, -1.5547, -1.2266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.4062, -3.0938,  1.4297,  0.9180, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:16,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:48:16,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.26 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.93
[2025-11-06 18:48:16,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.77 | bwd: 2.77 | bwd_inner: 1.71 | bwd_allreduce: 0.91 | step: 3.01
 73%|███████▎  | 2576/3507 [1:03:30<21:04,  1.36s/it]                                                     {'loss': 0.3708, 'learning_rate': 3.475542338203377e-06, 'epoch': 0.73}
 73%|███████▎  | 2576/3507 [1:03:30<21:04,  1.36s/it]tensor([[-3.9062, -0.5352,  2.8906,  0.3223, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -2.4844,  1.3203,  1.2188, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1719, -2.0938,  1.5469,  3.6250, -1.3203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:16,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.72 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-5.2500, -4.7500, -0.8477,  2.1250, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.1250, -5.4062, -1.0391, -0.1050, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -3.2812,  1.1641,  1.5391, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -0.7070,  2.6719,  0.4355, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -4.6250, -0.3086,  3.6250, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:18,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.26 | optimizer_step: 0.23
[2025-11-06 18:48:18,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.91 | bwd_microstep: 1992.19 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1990.99 | step_microstep: 2.30
[2025-11-06 18:48:18,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.66 | bwd: 1993.09 | bwd_inner: 1.90 | bwd_allreduce: 1991.05 | step: 2.39
 73%|███████▎  | 2577/3507 [1:03:32<25:50,  1.67s/it]                                                     {'loss': 0.412, 'learning_rate': 3.4685447987831967e-06, 'epoch': 0.73}
 73%|███████▎  | 2577/3507 [1:03:32<25:50,  1.67s/it]tensor([[-6.0000, -3.2031,  1.8047,  0.7109, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -2.7344,  1.5078,  0.6875, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -7.1250, -3.2344,  1.2188, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:19,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.28 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-6.6875, -3.6250,  2.3750,  1.1719, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -1.7344,  2.2031, -0.5703, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.0469,  1.3594,  2.4375, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -4.5938,  0.0583,  1.8750, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -4.1250, -2.1875,  2.2656, -0.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:48:20,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:48:20,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.10 | bwd_microstep: 753.19 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 752.03 | step_microstep: 2.19
[2025-11-06 18:48:20,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.41 | bwd: 754.26 | bwd_inner: 2.00 | bwd_allreduce: 752.09 | step: 2.28
 74%|███████▎  | 2578/3507 [1:03:34<23:37,  1.53s/it]                                                     {'loss': 0.1957, 'learning_rate': 3.461552832457462e-06, 'epoch': 0.74}
 74%|███████▎  | 2578/3507 [1:03:34<23:37,  1.53s/it]tensor([[-6.3125, -5.1875, -1.2734,  0.8359, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -2.5312,  2.3281, -1.1094, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:20,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.43 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.9375, -3.7344,  1.3750,  1.3750, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -5.3125,  0.0327,  2.4375, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -4.4688,  1.6641,  2.6094, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -1.4297,  1.3359, -2.5781, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -0.2393,  2.7812, -0.3047, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.0938, -4.6250,  0.8203,  2.8750, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:21,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:48:21,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.88 | bwd_microstep: 793.40 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 792.31 | step_microstep: 1.64
[2025-11-06 18:48:21,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.34 | bwd: 794.38 | bwd_inner: 1.86 | bwd_allreduce: 792.37 | step: 1.73
 74%|███████▎  | 2579/3507 [1:03:35<22:19,  1.44s/it]                                                     {'loss': 0.5748, 'learning_rate': 3.45456644519221e-06, 'epoch': 0.74}
 74%|███████▎  | 2579/3507 [1:03:35<22:19,  1.44s/it]tensor([[-5.3750, -2.4688,  1.5781, -0.3711, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5625, -7.1875, -2.4062,  1.3516, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -0.6211,  3.1250,  0.6875, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -4.9062, -1.0469,  2.7969, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:21,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.35 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.1875, -1.4844,  1.1797,  3.4219, -0.4785]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -1.8281,  1.0391,  0.9414, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[ 1.1484e+00,  4.1562e+00,  4.1875e+00,  4.9219e-01, -2.3193e-03]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -0.7148,  4.0000, -1.4688, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:48:22,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 18:48:22,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.27 | bwd_microstep: 849.80 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 848.49 | step_microstep: 2.42
[2025-11-06 18:48:22,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.65 | bwd: 850.79 | bwd_inner: 2.08 | bwd_allreduce: 848.55 | step: 2.50
 74%|███████▎  | 2580/3507 [1:03:36<21:35,  1.40s/it]                                                     {'loss': 0.2465, 'learning_rate': 3.447585642948712e-06, 'epoch': 0.74}
 74%|███████▎  | 2580/3507 [1:03:36<21:35,  1.40s/it]tensor([[-2.3906, -2.1406,  1.2656,  4.3750, -0.3242]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -3.2344,  0.4434,  2.4375, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:48:22,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.83 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.8750,  2.2188,  2.2344, -1.7188, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -4.4688, -0.9844,  2.3438, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938, -3.9219, -2.0312,  1.1797, -1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5312, -4.9688,  1.1797,  0.9570, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0312,  1.1484,  3.3750,  0.4844, -2.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6406, -2.5000,  1.1328,  2.6562, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:23,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.22
[2025-11-06 18:48:23,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.81 | bwd_microstep: 119.09 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 117.97 | step_microstep: 2.09
[2025-11-06 18:48:23,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.65 | bwd: 120.05 | bwd_inner: 1.89 | bwd_allreduce: 118.00 | step: 2.17
 74%|███████▎  | 2581/3507 [1:03:37<17:08,  1.11s/it]                                                     {'loss': 0.8258, 'learning_rate': 3.440610431683479e-06, 'epoch': 0.74}
 74%|███████▎  | 2581/3507 [1:03:37<17:08,  1.11s/it]tensor([[-6.4062, -3.2344,  1.8516, -0.1387, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0000, -5.8438, -0.0669,  2.6719, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:23,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.19 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.6719,  1.6094,  2.7188, -3.1875, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.1562, -4.9375,  0.0320,  2.3281, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -0.0049,  3.3125, -0.9023, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.6562,  0.8203,  2.8750, -0.8945, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.3125, -7.0000, -0.2871,  0.4531, -6.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6875, -3.0625,  0.9180,  1.6641, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:27,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.17 | optimizer_step: 0.23
[2025-11-06 18:48:27,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.80 | bwd_microstep: 2595.52 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 2594.40 | step_microstep: 2.36
[2025-11-06 18:48:27,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 499.02 | bwd: 2596.41 | bwd_inner: 1.78 | bwd_allreduce: 2594.46 | step: 2.47
 74%|███████▎  | 2582/3507 [1:03:41<30:45,  1.99s/it]                                                     {'loss': 0.7513, 'learning_rate': 3.4336408173482485e-06, 'epoch': 0.74}
 74%|███████▎  | 2582/3507 [1:03:41<30:45,  1.99s/it]tensor([[-6.5625, -4.2812,  1.0312,  0.9727, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -4.0312, -0.4746,  2.9219, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:27,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.19 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -0.4297,  3.4688, -2.1875, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -3.0781,  1.2266,  1.2109, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.6250, -4.2188,  1.7109, -0.3066, -6.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -4.3750, -2.1250,  1.9141, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -3.6406,  0.5039,  1.7812, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4688, -5.8125, -0.8984,  2.4688, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:27,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:48:27,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.37 | bwd_microstep: 229.35 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 228.19 | step_microstep: 1.67
[2025-11-06 18:48:27,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.57 | bwd: 230.38 | bwd_inner: 2.02 | bwd_allreduce: 228.24 | step: 1.76
 74%|███████▎  | 2583/3507 [1:03:41<24:25,  1.59s/it]                                                     {'loss': 0.2464, 'learning_rate': 3.426676805889979e-06, 'epoch': 0.74}
 74%|███████▎  | 2583/3507 [1:03:41<24:25,  1.59s/it]tensor([[-1.8828,  1.9453,  2.9062, -2.0938, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -3.7500,  0.6211,  3.0469, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:28,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.64 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.0000, -5.5625, -0.8164,  0.7070, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -1.6875,  2.7500, -2.2500, -6.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.5625,  1.7344,  0.7930, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -4.1250, -2.1719,  2.6406, -0.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -2.7188,  1.8516, -1.7500, -5.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7812, -4.4688, -0.3516,  1.0859, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:48:28,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:48:28,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.57 | bwd_microstep: 6.32 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 5.27 | step_microstep: 2.08
[2025-11-06 18:48:28,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 411.24 | bwd: 7.09 | bwd_inner: 1.67 | bwd_allreduce: 5.30 | step: 2.15
 74%|███████▎  | 2584/3507 [1:03:42<19:10,  1.25s/it]                                                     {'loss': 0.4672, 'learning_rate': 3.4197184032508636e-06, 'epoch': 0.74}
 74%|███████▎  | 2584/3507 [1:03:42<19:10,  1.25s/it]tensor([[-4.3438, -1.8359,  1.7422,  0.4648, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:28,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.13 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0000, -2.8438,  1.9453,  2.2031, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2148,  0.8164,  2.1719,  2.4219,  0.2129]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -2.5000,  1.3594,  0.9844, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -5.0938, -1.2031,  2.0312, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375,  0.2119,  3.7344, -1.0625, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -3.7812,  0.2910,  1.7578, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -4.3438, -0.4043,  1.7500, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:29,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.56 | optimizer_step: 0.46
[2025-11-06 18:48:29,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.37 | bwd_microstep: 775.27 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 774.15 | step_microstep: 4.69
[2025-11-06 18:48:29,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.53 | bwd: 776.32 | bwd_inner: 1.88 | bwd_allreduce: 774.24 | step: 4.79
 74%|███████▎  | 2585/3507 [1:03:43<19:02,  1.24s/it]                                                     {'loss': 0.7142, 'learning_rate': 3.4127656153682866e-06, 'epoch': 0.74}
 74%|███████▎  | 2585/3507 [1:03:43<19:02,  1.24s/it]tensor([[-3.9375, -1.2266,  2.2656,  0.9141, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:29,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.22 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.7500, -6.1250, -0.6445,  0.7656, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.1250, -6.4375, -0.2227,  1.3359, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219,  0.8477,  3.8438, -2.3906, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6016,  2.5312,  3.0625, -2.5000, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.3750,  0.3301,  2.9531, -1.1172, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -2.7656,  2.5469,  3.3125, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4688, -5.0625,  0.2539,  2.2344, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:32,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:48:32,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.52 | bwd_microstep: 986.58 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 985.47 | step_microstep: 1.88
[2025-11-06 18:48:32,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.77 | bwd: 987.54 | bwd_inner: 1.86 | bwd_allreduce: 985.52 | step: 1.96
 74%|███████▎  | 2586/3507 [1:03:45<24:47,  1.62s/it]                                                     {'loss': 0.5815, 'learning_rate': 3.405818448174857e-06, 'epoch': 0.74}
 74%|███████▎  | 2586/3507 [1:03:45<24:47,  1.62s/it]tensor([[-5.1250, -3.1250,  1.1094,  1.4922, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -4.2812, -1.0703,  2.8438, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -3.7031, -1.9062,  2.3125, -0.4805]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:32,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.23 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6094, -0.6016,  2.8594,  0.6953, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -2.2656,  3.0938,  2.5312, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -4.3438, -0.5742,  1.9844, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.3750, -5.7500,  0.3281,  0.0476, -6.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -3.0000,  0.9727,  0.5625, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:48:32,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:48:32,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.40 | bwd_microstep: 186.15 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 184.98 | step_microstep: 2.09
[2025-11-06 18:48:32,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.65 | bwd: 187.22 | bwd_inner: 2.07 | bwd_allreduce: 185.02 | step: 2.17
 74%|███████▍  | 2587/3507 [1:03:46<19:49,  1.29s/it]                                                     {'loss': 0.3723, 'learning_rate': 3.3988769075983796e-06, 'epoch': 0.74}
 74%|███████▍  | 2587/3507 [1:03:46<19:49,  1.29s/it]tensor([[-6.3125, -5.3750, -0.9688,  1.3125, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:32,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 94.57 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.0625, -4.9688, -0.0258,  2.3750, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -5.9688, -2.8438,  2.6406, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -4.1250, -0.9219,  2.5156, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -3.8125,  0.1357,  1.2500, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -3.2812,  1.3047,  0.7188, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.6875, -6.8125, -0.1221,  1.6172, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1875, -4.2188,  1.8984,  0.7188, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:33,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:48:33,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.09 | bwd_microstep: 139.38 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 138.27 | step_microstep: 1.94
[2025-11-06 18:48:33,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 247.67 | bwd: 140.32 | bwd_inner: 1.89 | bwd_allreduce: 138.30 | step: 2.01
 74%|███████▍  | 2588/3507 [1:03:46<16:08,  1.05s/it]                                                     {'loss': 0.3277, 'learning_rate': 3.391940999561871e-06, 'epoch': 0.74}
 74%|███████▍  | 2588/3507 [1:03:46<16:08,  1.05s/it]tensor([[-0.7500,  3.2344,  3.2188, -2.0312, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2344,  0.5391,  3.3750, -0.7695, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -6.1250, -1.3828,  2.0000, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:33,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.77 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8750, -4.6250, -0.8164,  2.7344, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4688,  1.4297,  2.4844, -2.6562, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4375, -5.8125, -0.7461,  3.0156, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -5.2188, -0.4512,  0.7070, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -2.1250,  1.3750, -1.2734, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:48:35,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.23 | optimizer_step: 0.26
[2025-11-06 18:48:35,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.79 | bwd_microstep: 1649.76 | bwd_inner_microstep: 2.32 | bwd_allreduce_microstep: 1647.33 | step_microstep: 7.32
[2025-11-06 18:48:35,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.52 | bwd: 1650.64 | bwd_inner: 3.10 | bwd_allreduce: 1647.38 | step: 7.41
 74%|███████▍  | 2589/3507 [1:03:49<22:42,  1.48s/it]                                                     {'loss': 0.1971, 'learning_rate': 3.385010729983529e-06, 'epoch': 0.74}
 74%|███████▍  | 2589/3507 [1:03:49<22:42,  1.48s/it]tensor([[-5.9375, -4.7812, -0.7227,  1.3125, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -0.1055,  1.6094, -0.5703, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -1.5000,  1.0469, -2.6719, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:35,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.29 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.4375, -5.6562,  0.1094,  1.2656, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -2.1406,  2.6719,  1.1172, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -3.3906,  1.8672,  1.2422, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -3.9062, -0.0684,  0.9570, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.0234,  2.2188,  2.2031, -1.6250, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:48:36,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.59 | optimizer_gradients: 0.25 | optimizer_step: 0.21
[2025-11-06 18:48:36,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.97 | bwd_microstep: 64.54 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 63.37 | step_microstep: 10.77
[2025-11-06 18:48:36,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 412.28 | bwd: 65.23 | bwd_inner: 1.64 | bwd_allreduce: 63.41 | step: 10.85
 74%|███████▍  | 2590/3507 [1:03:49<18:20,  1.20s/it]                                                     {'loss': 0.7748, 'learning_rate': 3.378086104776743e-06, 'epoch': 0.74}
 74%|███████▍  | 2590/3507 [1:03:49<18:20,  1.20s/it]tensor([[-4.3438, -3.0625,  0.9492,  2.5938, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -1.1172,  2.7344,  0.8984, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -2.9062,  0.5469,  1.3125, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9688, -0.6289,  3.3594,  0.5586, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:37,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.57 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.5000, -5.6562, -0.9766, -0.3809, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7344, -3.1562, -0.9883,  2.8125, -0.4238]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -5.9062, -1.7656,  2.7656, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -0.6797,  2.8281,  1.1797, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:39,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:48:39,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.66 | bwd_microstep: 1338.92 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1337.88 | step_microstep: 2.17
[2025-11-06 18:48:39,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.24 | bwd: 1339.60 | bwd_inner: 1.54 | bwd_allreduce: 1337.93 | step: 2.25
 74%|███████▍  | 2591/3507 [1:03:52<26:27,  1.73s/it]                                                     {'loss': 0.4773, 'learning_rate': 3.371167129850089e-06, 'epoch': 0.74}
 74%|███████▍  | 2591/3507 [1:03:52<26:27,  1.73s/it]tensor([[-4.4062, -2.5312,  1.2031,  1.6484, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:39,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.54 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.8750, -1.8594,  1.4219, -1.1094, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5156e+00,  9.9182e-04,  2.7500e+00, -9.5703e-01, -3.5781e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -4.7500, -0.6797,  2.1562, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -2.3906,  0.5938,  0.7773, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -1.0078,  3.1719, -2.1250, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -2.1094,  1.3516,  1.1484, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3438, -3.1250,  2.7344,  1.0938, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:39,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:48:39,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.48 | bwd_microstep: 1.54 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.67 | step_microstep: 1.83
[2025-11-06 18:48:39,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 455.05 | bwd: 2.49 | bwd_inner: 1.61 | bwd_allreduce: 0.72 | step: 1.93
 74%|███████▍  | 2592/3507 [1:03:53<20:47,  1.36s/it]                                                     {'loss': 0.6452, 'learning_rate': 3.3642538111073207e-06, 'epoch': 0.74}
 74%|███████▍  | 2592/3507 [1:03:53<20:47,  1.36s/it]tensor([[-3.6406, -4.1562, -1.7734,  2.2344, -1.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:39,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.47 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9219, -0.9062,  2.0781, -0.0732, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8750,  0.5547,  2.3281, -1.8203, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5469, -3.4375, -0.1543,  3.4375, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1562, -5.1875, -0.9961,  1.4766, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -1.0938,  1.1953, -0.3066, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -3.1719,  1.2891,  3.3438, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -5.4062,  0.9844,  2.4219, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:48:41,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:48:41,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.58 | bwd_microstep: 1704.69 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1703.61 | step_microstep: 2.23
[2025-11-06 18:48:41,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.99 | bwd: 1705.69 | bwd_inner: 1.89 | bwd_allreduce: 1703.66 | step: 2.31
 74%|███████▍  | 2593/3507 [1:03:55<25:16,  1.66s/it]                                                     {'loss': 0.8553, 'learning_rate': 3.357346154447364e-06, 'epoch': 0.74}
 74%|███████▍  | 2593/3507 [1:03:55<25:16,  1.66s/it]tensor([[-4.5938, -1.5234,  2.5625,  0.3652, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -5.0000, -0.3184,  2.7031, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.4531,  1.3750,  0.8828, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -3.1719,  1.0547, -1.2266, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:42,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.93 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -3.7969,  0.5117,  1.7812, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -4.0625,  0.2734,  2.1250, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -1.2891,  3.2969, -0.9531, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -0.0198,  3.7812, -1.6094, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:48:42,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:48:42,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.92 | bwd_microstep: 293.05 | bwd_inner_microstep: 1.63 | bwd_allreduce_microstep: 291.32 | step_microstep: 1.79
[2025-11-06 18:48:42,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.87 | bwd: 294.22 | bwd_inner: 2.71 | bwd_allreduce: 291.36 | step: 1.86
 74%|███████▍  | 2594/3507 [1:03:56<21:27,  1.41s/it]                                                     {'loss': 0.8722, 'learning_rate': 3.350444165764315e-06, 'epoch': 0.74}
 74%|███████▍  | 2594/3507 [1:03:56<21:27,  1.41s/it]tensor([[-3.9375, -2.9062,  0.7461,  2.2344, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312,  0.7461,  3.5156, -1.9219, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -4.3438, -3.3906,  0.1689, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:48:43,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.11 | bwd_microstep: 5.41 | bwd_inner_microstep: 5.26 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.1250, -6.0938, -2.1250,  1.6641, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6875, -3.6250, -0.1006,  1.4922, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -2.4219,  1.6250,  1.6406, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -3.9219,  1.0625,  1.1953, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -4.6562, -0.2930,  1.0703, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:48:45,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.25
[2025-11-06 18:48:45,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.92 | bwd_microstep: 1692.26 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 1690.99 | step_microstep: 2.19
[2025-11-06 18:48:45,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.05 | bwd: 1697.67 | bwd_inner: 6.45 | bwd_allreduce: 1691.05 | step: 2.29
 74%|███████▍  | 2595/3507 [1:03:59<26:54,  1.77s/it]                                                     {'loss': 0.6605, 'learning_rate': 3.343547850947434e-06, 'epoch': 0.74}
 74%|███████▍  | 2595/3507 [1:03:59<26:54,  1.77s/it]tensor([[-2.3594, -1.0156,  3.2344,  4.8750, -0.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:45,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.92 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.0938, -2.1875,  1.9062,  0.1562, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7812,  0.6484,  3.1562,  2.0781, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6875, -3.5156,  2.1562,  0.0752, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -3.2656,  0.6602,  2.4531, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6562, -0.1016,  2.5312,  1.3281, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -4.0000, -0.1108,  1.6719, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0469, -1.8984,  0.2734,  0.9219, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:45,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:48:45,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.11 | bwd_microstep: 195.45 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 194.20 | step_microstep: 1.52
[2025-11-06 18:48:45,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.03 | bwd: 196.30 | bwd_inner: 1.93 | bwd_allreduce: 194.24 | step: 1.61
 74%|███████▍  | 2596/3507 [1:03:59<21:37,  1.42s/it]                                                     {'loss': 0.6067, 'learning_rate': 3.3366572158811384e-06, 'epoch': 0.74}
 74%|███████▍  | 2596/3507 [1:03:59<21:37,  1.42s/it]tensor([[-6.9062e+00, -5.0625e+00, -7.1484e-01,  5.8594e-03, -4.7812e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.7344,  0.1504,  1.8359, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:46,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.39 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5625, -2.3906,  1.1875,  0.8242, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2344, -2.2031, -1.0781,  3.3438,  0.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.4688,  0.2412,  2.3125, -1.7969, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5625, -3.2188, -1.8359,  2.0625, -0.2422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -2.2188,  2.5156,  1.0781, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -1.2734,  3.2344,  0.1416, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:48:48,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:48:48,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.96 | bwd_microstep: 959.33 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 958.04 | step_microstep: 2.24
[2025-11-06 18:48:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.38 | bwd: 960.13 | bwd_inner: 1.89 | bwd_allreduce: 958.09 | step: 2.33
 74%|███████▍  | 2597/3507 [1:04:02<25:16,  1.67s/it]                                                     {'loss': 0.5699, 'learning_rate': 3.3297722664450005e-06, 'epoch': 0.74}
 74%|███████▍  | 2597/3507 [1:04:02<25:16,  1.67s/it]tensor([[-6.0938, -4.0938,  1.2812,  2.2344, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -2.5938,  2.4844,  0.5820, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -2.6406,  0.8320,  1.9062, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:48,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9375, -3.7188, -0.7812,  4.2812, -0.1680]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.4844,  2.2188,  3.4062, -1.0703, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -4.6875, -1.3984,  3.1719, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -3.8906, -0.0796,  2.3125, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4688, -5.6562, -1.0312,  1.5547, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:49,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:48:49,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.93 | bwd_microstep: 933.36 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 932.24 | step_microstep: 107.58
[2025-11-06 18:48:49,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.74 | bwd: 934.46 | bwd_inner: 2.00 | bwd_allreduce: 932.29 | step: 107.67
 74%|███████▍  | 2598/3507 [1:04:03<24:16,  1.60s/it]                                                     {'loss': 0.2802, 'learning_rate': 3.32289300851374e-06, 'epoch': 0.74}
 74%|███████▍  | 2598/3507 [1:04:03<24:16,  1.60s/it]tensor([[-4.1875, -3.4219,  0.2695,  2.7500, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:49,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.43 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1562, -4.6875, -0.4043,  3.0625, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -2.1562,  1.8359,  3.9219, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8750, -0.3672,  2.7031, -0.8086, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -2.8438,  2.7969, -0.9141, -6.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2812, -3.8750, -2.2500,  1.4766, -0.8945]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1797, -2.0312, -2.3594,  0.5742,  0.5039]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.1875, -2.2188,  2.0312,  0.0571, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:48:51,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:48:51,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.64 | bwd_microstep: 1326.01 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1324.88 | step_microstep: 1.87
[2025-11-06 18:48:51,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.10 | bwd: 1327.18 | bwd_inner: 2.13 | bwd_allreduce: 1324.92 | step: 1.94
 74%|███████▍  | 2599/3507 [1:04:05<24:51,  1.64s/it]                                                     {'loss': 0.7199, 'learning_rate': 3.3160194479572193e-06, 'epoch': 0.74}
 74%|███████▍  | 2599/3507 [1:04:05<24:51,  1.64s/it]tensor([[-4.1250, -3.9375, -0.1787,  3.4531, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -0.8477,  4.1562, -0.9531, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-6.3750, -4.9062,  0.0569,  1.4297, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([2], device='cuda:1')tensor([2], device='cuda:3')

tensor([[-4.5625, -0.4102,  2.2656, -2.5312, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:51,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.40 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.2910,  2.4062,  2.4688, -0.6445, -1.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.7812, -2.2500,  0.9961,  3.6719, -0.8398]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5312, -3.0469,  2.5156,  0.1562, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6328,  1.3750,  3.4531,  0.7266, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:48:52,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:48:52,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 268.84 | bwd_microstep: 219.51 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 218.31 | step_microstep: 1.93
[2025-11-06 18:48:52,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 483.27 | bwd: 220.23 | bwd_inner: 1.72 | bwd_allreduce: 218.36 | step: 2.01
 74%|███████▍  | 2600/3507 [1:04:05<20:45,  1.37s/it]                                                     {'loss': 0.3595, 'learning_rate': 3.309151590640446e-06, 'epoch': 0.74}
 74%|███████▍  | 2600/3507 [1:04:05<20:45,  1.37s/it]tensor([[-7.0938, -5.1250,  0.1021,  0.7227, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:52,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.68 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.2793,  3.5938,  3.1719, -2.2031, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.7031, -0.4531,  1.2422, -2.0938, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -3.6719,  0.6445,  2.4062, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8438, -5.9375, -2.3594,  0.0092, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7969, -4.4688, -1.8281,  2.5625, -1.0391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1846,  1.2500,  0.4355, -0.2656, -0.1670]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -5.4062, -1.6797,  2.9375, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:53,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:48:53,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.76 | bwd_microstep: 582.25 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 581.17 | step_microstep: 1.82
[2025-11-06 18:48:53,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 255.46 | bwd: 583.14 | bwd_inner: 1.78 | bwd_allreduce: 581.22 | step: 1.90
 74%|███████▍  | 2601/3507 [1:04:06<18:28,  1.22s/it]                                                     {'loss': 0.6795, 'learning_rate': 3.3022894424235573e-06, 'epoch': 0.74}
 74%|███████▍  | 2601/3507 [1:04:06<18:28,  1.22s/it]tensor([[-6.0000, -5.7500, -1.5703,  2.1562, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -4.8438, -0.5117,  2.7656, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6875, -3.1406, -1.3438,  2.5469, -0.3320]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:48:53,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.06 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0312, -1.4219,  2.8906, -0.3516, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.8125, -3.0312,  3.1719,  0.3223, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6836,  3.5469,  4.9375, -0.9023, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5625, -0.6133,  4.0938,  4.1875, -1.4141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -0.6328,  3.1406, -0.9180, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:48:55,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:48:55,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 68.16 | bwd_microstep: 1966.05 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1965.06 | step_microstep: 2.30
[2025-11-06 18:48:55,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 246.25 | bwd: 1966.93 | bwd_inner: 1.67 | bwd_allreduce: 1965.11 | step: 2.39
 74%|███████▍  | 2602/3507 [1:04:09<23:05,  1.53s/it]                                                     {'loss': 0.5189, 'learning_rate': 3.2954330091618104e-06, 'epoch': 0.74}
 74%|███████▍  | 2602/3507 [1:04:09<23:05,  1.53s/it]tensor([[-1.6953,  1.6562,  4.3438,  1.1016, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -4.5625, -0.2061,  1.9141, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:55,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.48 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.8203,  1.2422,  2.8281, -0.9805, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1875, -3.2344,  2.6875, -0.7461, -6.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -4.0625,  0.7734,  3.8750, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0625,  1.2266,  3.8594, -1.8750, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.0625,  0.3691,  3.2812, -2.0312, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375,  0.3516,  3.4688, -2.8281, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:48:55,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:48:55,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.46 | bwd_microstep: 267.44 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 266.17 | step_microstep: 1.73
[2025-11-06 18:48:55,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.98 | bwd: 268.34 | bwd_inner: 2.01 | bwd_allreduce: 266.20 | step: 1.81
 74%|███████▍  | 2603/3507 [1:04:09<19:04,  1.27s/it]                                                     {'loss': 0.4163, 'learning_rate': 3.2885822967055957e-06, 'epoch': 0.74}
 74%|███████▍  | 2603/3507 [1:04:09<19:04,  1.27s/it]tensor([[-4.2812, -4.3125, -1.0547,  2.7188, -1.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-6.4688, -6.3750, -1.9062,  2.1875, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6875, -4.9375,  0.9023,  2.5781, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:48:56,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.49 | bwd_microstep: 1.99 | bwd_inner_microstep: 1.77 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15
tensor([[-2.9375,  0.1377,  2.5312, -0.1416, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -3.1875,  2.5625,  2.2031, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -4.0312, -0.5820,  2.1250, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9844, -3.9531, -2.5781,  1.7188, -0.5352]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -2.0469,  1.9219, -1.1094, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:48:58,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:48:58,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.49 | bwd_microstep: 1939.37 | bwd_inner_microstep: 6.25 | bwd_allreduce_microstep: 1932.95 | step_microstep: 2.33
[2025-11-06 18:48:58,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.02 | bwd: 1941.38 | bwd_inner: 8.08 | bwd_allreduce: 1933.05 | step: 2.48
 74%|███████▍  | 2604/3507 [1:04:12<23:59,  1.59s/it]                                                     {'loss': 1.3277, 'learning_rate': 3.2817373109004247e-06, 'epoch': 0.74}
 74%|███████▍  | 2604/3507 [1:04:12<23:59,  1.59s/it]tensor([[-2.5938, -3.5156, -2.0781,  2.3438, -0.1211]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -3.4844,  1.1484,  2.2812, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:58,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.74 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -3.9531, -0.1016,  2.6719, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -4.6562, -0.2344,  2.0469, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9805,  2.9062,  2.6875, -2.6094, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.2656, -2.6562, -1.0078,  2.5312, -0.1270]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -4.2500, -0.6758,  2.6406, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4531,  0.9961,  3.8438, -1.8594, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:48:58,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:48:58,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.78 | bwd_microstep: 114.88 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 113.79 | step_microstep: 1.91
[2025-11-06 18:48:58,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.54 | bwd: 115.85 | bwd_inner: 1.89 | bwd_allreduce: 113.83 | step: 1.99
 74%|███████▍  | 2605/3507 [1:04:12<18:42,  1.24s/it]                                                     {'loss': 0.1723, 'learning_rate': 3.274898057586916e-06, 'epoch': 0.74}
 74%|███████▍  | 2605/3507 [1:04:12<18:42,  1.24s/it]tensor([[0.7539, 1.5000, 4.0625, 5.8750, 1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:48:58,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.41 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4375, -4.9688, -0.0092,  1.4062, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -5.7812, -0.1011,  2.0312, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625, -3.9219, -1.8984,  1.3594, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7500, -3.6250,  1.9453,  0.1709, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -1.1562,  3.1875,  1.7500, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9805,  1.0625,  3.1719,  1.8906, -0.8203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -4.1562,  0.3340,  2.6719, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:00,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:49:00,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.45 | bwd_microstep: 1254.19 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1253.02 | step_microstep: 5.92
[2025-11-06 18:49:00,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.90 | bwd: 1254.90 | bwd_inner: 1.66 | bwd_allreduce: 1253.06 | step: 6.00
 74%|███████▍  | 2606/3507 [1:04:14<20:27,  1.36s/it]                                                     {'loss': 0.1842, 'learning_rate': 3.2680645426007984e-06, 'epoch': 0.74}
 74%|███████▍  | 2606/3507 [1:04:14<20:27,  1.36s/it]tensor([[-3.0625, -3.1406, -0.3027,  3.2812, -0.7695]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219, -1.8125,  1.8125,  1.3516, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:00,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.38 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.2812, -2.9219,  0.5156,  3.7656, -1.0859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8047,  1.5156,  4.2188,  3.2500, -0.4629]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -5.7188, -2.3750,  2.0625, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -3.7188,  0.2559,  0.5625, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1562, -4.3125, -0.0388,  0.7656, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -3.6094, -2.7344,  1.1094, -0.3965]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:01,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:49:01,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.05 | bwd_microstep: 666.22 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 665.24 | step_microstep: 3.15
[2025-11-06 18:49:01,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.45 | bwd: 667.16 | bwd_inner: 1.68 | bwd_allreduce: 665.30 | step: 3.26
 74%|███████▍  | 2607/3507 [1:04:15<19:07,  1.28s/it]                                                     {'loss': 0.4348, 'learning_rate': 3.2612367717729056e-06, 'epoch': 0.74}
 74%|███████▍  | 2607/3507 [1:04:15<19:07,  1.28s/it]tensor([[-4.7188, -2.5938,  0.8008,  0.5039, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -4.5000, -1.0547,  2.0312, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500,  0.4062,  3.5469, -2.3906, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0625, -1.6016,  2.0938, -0.8633, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -1.2031,  2.7969, -1.0469, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -1.1641,  2.7969, -1.5859, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656, -4.0000, -1.8594,  2.5469, -0.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:03,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.11 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0625, -2.1875,  1.6016,  0.1069, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:03,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:49:03,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.26 | bwd_microstep: 1.75 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.12
[2025-11-06 18:49:03,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.38 | bwd: 2.47 | bwd_inner: 1.49 | bwd_allreduce: 0.83 | step: 7.20
 74%|███████▍  | 2608/3507 [1:04:17<22:58,  1.53s/it]                                                     {'loss': 0.7194, 'learning_rate': 3.254414750929169e-06, 'epoch': 0.74}
 74%|███████▍  | 2608/3507 [1:04:17<22:58,  1.53s/it]tensor([[-2.3438,  1.4219,  3.5312, -1.3047, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:03,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.67 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3750,  0.0732,  2.9844, -1.2188, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -1.8984,  1.4062,  0.8359, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -4.2500, -0.2637,  2.1875, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -2.6406,  2.1094,  1.1406, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -3.4219,  1.0312,  0.6523, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719,  0.4316,  3.8125, -0.8359, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.3438, -6.5312, -2.6250,  1.7109, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:05,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:49:05,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.71 | bwd_microstep: 1282.44 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1281.39 | step_microstep: 2.18
[2025-11-06 18:49:05,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.41 | bwd: 1283.13 | bwd_inner: 1.54 | bwd_allreduce: 1281.44 | step: 2.26
 74%|███████▍  | 2609/3507 [1:04:18<23:16,  1.56s/it]                                                     {'loss': 1.5466, 'learning_rate': 3.247598485890614e-06, 'epoch': 0.74}
 74%|███████▍  | 2609/3507 [1:04:18<23:16,  1.56s/it]tensor([[-4.9062, -1.5938,  2.6875, -0.1201, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -4.3125, -0.7539,  3.0312, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1406, -3.4688, -0.6133,  3.5625, -0.6055]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:05,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.63 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.4375, -4.8438, -2.0156,  2.0938, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -4.3750,  1.3672,  3.8906, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.1484,  2.3125,  3.9062,  0.2871, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -0.5586,  1.9141, -0.0452, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -1.4375,  2.5312, -0.6328, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:06,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.23 | optimizer_step: 0.23
[2025-11-06 18:49:06,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.57 | bwd_microstep: 812.05 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 810.98 | step_microstep: 2.17
[2025-11-06 18:49:06,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.23 | bwd: 812.77 | bwd_inner: 1.57 | bwd_allreduce: 811.03 | step: 2.25
 74%|███████▍  | 2610/3507 [1:04:20<21:37,  1.45s/it]                                                     {'loss': 0.6882, 'learning_rate': 3.2407879824733535e-06, 'epoch': 0.74}
 74%|███████▍  | 2610/3507 [1:04:20<21:37,  1.45s/it]tensor([[-4.5312, -2.8125,  1.1094,  1.7500, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -4.4375, -1.7344,  3.0781, -0.8789]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -4.9375, -0.0613,  0.8594, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2969,  2.8125,  3.0000, -2.6562, -2.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:06,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.93 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-5.9062, -3.5469,  1.2812,  0.8828, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8125, -3.8281,  0.6289,  1.0312, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -4.9688, -0.1816,  1.4922, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -4.8125,  0.1553,  1.9688, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:10,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.25
[2025-11-06 18:49:10,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.85 | bwd_microstep: 3818.64 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 3817.63 | step_microstep: 2.19
[2025-11-06 18:49:10,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.81 | bwd: 3819.71 | bwd_inner: 1.85 | bwd_allreduce: 3817.68 | step: 2.30
 74%|███████▍  | 2611/3507 [1:04:24<34:03,  2.28s/it]                                                     {'loss': 0.5292, 'learning_rate': 3.2339832464885846e-06, 'epoch': 0.74}
 74%|███████▍  | 2611/3507 [1:04:24<34:03,  2.28s/it]tensor([[-5.6562, -4.3750,  0.1108,  1.8594, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -4.5312, -2.0781,  2.1406, -1.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:10,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.25 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -4.3750, -0.0210,  2.4844, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -2.3594,  1.9453,  0.3301, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5312, -5.9688, -1.0625,  2.6406, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9688, -3.2031, -0.7383,  3.1406, -0.5664]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.2188, -3.7969,  2.4062,  0.2197, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -6.2812, -1.1250,  3.2344, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:11,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:49:11,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.13 | bwd_microstep: 47.06 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 46.22 | step_microstep: 1.85
[2025-11-06 18:49:11,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.39 | bwd: 47.93 | bwd_inner: 1.54 | bwd_allreduce: 46.25 | step: 1.93
 74%|███████▍  | 2612/3507 [1:04:24<25:46,  1.73s/it]                                                     {'loss': 0.0862, 'learning_rate': 3.2271842837425917e-06, 'epoch': 0.74}
 74%|███████▍  | 2612/3507 [1:04:24<25:46,  1.73s/it]tensor([[-2.9531, -1.3594,  3.0156,  4.1562, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -1.8125,  0.8047, -2.3281, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -5.0938, -3.7188,  0.7500, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:49:11,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.67 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7812, -2.6406,  1.0703,  0.5508, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -4.0312, -0.2930,  2.0156, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -0.9492,  3.3750,  0.1738, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -4.2188, -0.4316,  2.9375, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -2.6094,  2.9531, -0.2695, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:12,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:49:12,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.55 | bwd_microstep: 674.94 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 674.09 | step_microstep: 2.81
[2025-11-06 18:49:12,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.25 | bwd: 675.58 | bwd_inner: 1.31 | bwd_allreduce: 674.13 | step: 2.89
 75%|███████▍  | 2613/3507 [1:04:25<22:44,  1.53s/it]                                                     {'loss': 0.5655, 'learning_rate': 3.220391100036716e-06, 'epoch': 0.75}
 75%|███████▍  | 2613/3507 [1:04:25<22:44,  1.53s/it]tensor([[-5.1562, -0.7695,  3.1094, -2.2812, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -3.0156,  1.0781,  3.1406, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:12,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.60 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.5625, -4.1250,  1.2812,  1.2891, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.7188,  2.0000,  2.4375, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3750, -5.9375, -1.6172,  1.7734, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250,  0.7148,  3.7812, -2.5312, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -1.1250,  3.1406, -1.1016, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -3.0469,  1.0859,  1.1719, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:13,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:49:13,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.02 | bwd_microstep: 806.30 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 805.31 | step_microstep: 1.71
[2025-11-06 18:49:13,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 506.65 | bwd: 807.10 | bwd_inner: 1.63 | bwd_allreduce: 805.35 | step: 1.78
 75%|███████▍  | 2614/3507 [1:04:27<21:59,  1.48s/it]                                                     {'loss': 0.3313, 'learning_rate': 3.2136037011673803e-06, 'epoch': 0.75}
 75%|███████▍  | 2614/3507 [1:04:27<21:59,  1.48s/it]tensor([[-5.3438, -1.7266,  2.7656, -0.1289, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5000, -4.5312, -0.5273,  1.7969, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7969,  0.6055,  2.2031,  0.0649, -1.8047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1562, -5.3750,  0.4883,  1.9688, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:13,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.50 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3750, -1.5156,  1.9688,  0.4375, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -1.8359,  2.2344,  1.2188, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -0.9609,  2.7344, -0.3496, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7969, -2.8281, -0.6367,  0.3633, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:14,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 18:49:14,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.96 | bwd_microstep: 956.74 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 955.54 | step_microstep: 3.28
[2025-11-06 18:49:14,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.49 | bwd: 957.50 | bwd_inner: 1.78 | bwd_allreduce: 955.59 | step: 3.35
 75%|███████▍  | 2615/3507 [1:04:28<21:33,  1.45s/it]                                                     {'loss': 0.3397, 'learning_rate': 3.206822092926065e-06, 'epoch': 0.75}
 75%|███████▍  | 2615/3507 [1:04:28<21:33,  1.45s/it]tensor([[-5.4688, -3.6875,  1.7344,  3.2188, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -3.8438,  0.8867,  1.9297, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -0.7969,  1.9453, -0.0068, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:15,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.53 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.8125, -4.0000,  1.5703,  0.9062, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -3.7500, -0.3711,  3.5781, -1.2266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938, -3.6406, -0.4785,  3.1094, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8438, -3.5156, -1.8984,  1.8516, -0.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -1.3359,  2.1562, -0.4570, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:16,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:49:16,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.95 | bwd_microstep: 1457.19 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1456.07 | step_microstep: 2.51
[2025-11-06 18:49:16,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.49 | bwd: 1458.14 | bwd_inner: 1.85 | bwd_allreduce: 1456.12 | step: 2.58
 75%|███████▍  | 2616/3507 [1:04:30<23:25,  1.58s/it]                                                     {'loss': 0.1773, 'learning_rate': 3.2000462810993205e-06, 'epoch': 0.75}
 75%|███████▍  | 2616/3507 [1:04:30<23:25,  1.58s/it]tensor([[-3.1406, -0.2217,  1.9922, -0.7656, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:16,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.35 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3438, -3.0156,  1.1016,  0.5703, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.6250,  0.2754,  3.2812, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.5312, -1.2188,  2.8750, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938,  0.7188,  4.0938, -0.8164, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -3.3750,  0.8164,  0.5898, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -6.0312, -2.8438,  2.0469, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312,  0.0430,  3.2188, -1.3984, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:17,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:49:17,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.01 | bwd_microstep: 109.56 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 108.40 | step_microstep: 2.51
[2025-11-06 18:49:17,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.37 | bwd: 110.58 | bwd_inner: 2.01 | bwd_allreduce: 108.44 | step: 2.59
 75%|███████▍  | 2617/3507 [1:04:31<18:33,  1.25s/it]                                                     {'loss': 0.2476, 'learning_rate': 3.1932762714687417e-06, 'epoch': 0.75}
 75%|███████▍  | 2617/3507 [1:04:31<18:33,  1.25s/it]tensor([[-7.0312, -5.1875,  1.0469,  2.8438, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -2.4219,  0.6992, -0.4551, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.0312,  0.1367,  2.5156, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:17,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.44 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.8125, -5.3125, -1.1484,  1.9297, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2188, -2.0000, -1.0078,  2.8594,  0.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -1.3984,  2.1094, -0.1553, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4219, -0.6172,  2.5312,  0.0289, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344, -3.5781, -1.3281,  2.5156, -0.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:19,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:49:19,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.66 | bwd_microstep: 1965.70 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1964.65 | step_microstep: 2.04
[2025-11-06 18:49:19,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.13 | bwd: 1966.54 | bwd_inner: 1.71 | bwd_allreduce: 1964.69 | step: 2.13
 75%|███████▍  | 2618/3507 [1:04:33<23:20,  1.58s/it]                                                     {'loss': 0.1364, 'learning_rate': 3.1865120698109675e-06, 'epoch': 0.75}
 75%|███████▍  | 2618/3507 [1:04:33<23:20,  1.58s/it]tensor([[-2.0781,  0.3945,  1.8828, -0.8320, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -4.5000, -0.3652,  2.7031, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -3.3281,  0.0374,  2.1250, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -3.5625,  1.0625,  1.4688, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:19,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.39 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.9062, -4.2500,  0.0669,  1.2422, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625,  0.5078,  2.5469, -3.6094, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -2.2344,  2.3125,  0.1426, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2656,  1.6172,  4.8438, -1.7266, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:49:19,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:49:19,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.51 | bwd_microstep: 108.03 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 106.99 | step_microstep: 1.85
[2025-11-06 18:49:19,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.93 | bwd: 108.93 | bwd_inner: 1.77 | bwd_allreduce: 107.04 | step: 1.95
 75%|███████▍  | 2619/3507 [1:04:33<18:26,  1.25s/it]                                                     {'loss': 0.3412, 'learning_rate': 3.1797536818976894e-06, 'epoch': 0.75}
 75%|███████▍  | 2619/3507 [1:04:33<18:26,  1.25s/it]tensor([[-3.2500,  0.4668,  2.6250, -1.6641, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5938, -4.0938,  1.5000,  1.3594, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -4.4375,  0.2334,  1.5781, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -1.8203,  3.2031,  1.9219, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -3.1875,  1.7031,  3.5156, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:20,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.93 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6562, -4.7812, -0.4668,  2.2344, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -5.1875, -1.3359,  3.0781, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -4.0938, -1.4453,  2.7500, -0.9961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:21,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:49:21,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.43 | bwd_microstep: 1085.55 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1084.42 | step_microstep: 1.88
[2025-11-06 18:49:21,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.40 | bwd: 1086.33 | bwd_inner: 1.71 | bwd_allreduce: 1084.47 | step: 1.97
 75%|███████▍  | 2620/3507 [1:04:35<21:21,  1.45s/it]                                                     {'loss': 0.4178, 'learning_rate': 3.173001113495643e-06, 'epoch': 0.75}
 75%|███████▍  | 2620/3507 [1:04:35<21:21,  1.45s/it]tensor([[-4.5312, -1.8906,  2.4688,  1.7500, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -0.9414,  2.1562, -1.4766, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -2.2500,  2.5625, -0.7930, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -3.1562,  0.9258,  1.0938, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:22,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.71 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6094, -3.4844, -0.4141,  2.8281, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562,  0.5039,  2.3281, -1.4531, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.3750,  1.7734,  4.3125,  1.3828, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -1.0156,  1.8828,  0.4844, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:22,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:49:22,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.59 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.88 | step_microstep: 1.99
[2025-11-06 18:49:22,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.32 | bwd: 3.20 | bwd_inner: 2.14 | bwd_allreduce: 0.92 | step: 2.10
 75%|███████▍  | 2621/3507 [1:04:36<17:38,  1.19s/it]                                                     {'loss': 0.8176, 'learning_rate': 3.1662543703665873e-06, 'epoch': 0.75}
 75%|███████▍  | 2621/3507 [1:04:36<17:38,  1.19s/it]tensor([[-4.8125, -1.5312,  2.5938,  0.2695, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8594, -3.7812, -0.8438,  2.5000, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -3.1719,  1.0781,  1.2188, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1562, -2.8594, -2.0000,  1.5391, -0.0271]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -5.4688, -2.4375,  1.7266, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:23,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.09 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.9688, -5.5625, -0.9844,  2.8438, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -3.8750,  0.6875,  2.2500, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4062,  0.0830,  3.3438,  0.0776, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:23,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:49:23,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.50 | bwd_microstep: 528.13 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 526.86 | step_microstep: 1.89
[2025-11-06 18:49:23,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.61 | bwd: 529.13 | bwd_inner: 2.09 | bwd_allreduce: 526.90 | step: 1.98
 75%|███████▍  | 2622/3507 [1:04:37<18:35,  1.26s/it]                                                     {'loss': 0.1725, 'learning_rate': 3.159513458267317e-06, 'epoch': 0.75}
 75%|███████▍  | 2622/3507 [1:04:37<18:35,  1.26s/it]tensor([[-7.8750, -7.4688, -3.2031,  0.1445, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -3.7031,  0.7812,  1.6250, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -1.3516,  2.7344, -0.6602, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:24,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.71 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6875, -5.8438, -1.2812,  1.6016, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -2.6094, -0.0601, -2.4531, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -0.2910,  3.2344, -0.4609, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5781, -2.2500, -1.6875,  1.3906,  0.2471]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.8281, -0.8008,  2.6562,  0.1885, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:25,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:49:25,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.95 | bwd_microstep: 738.58 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 737.32 | step_microstep: 2.72
[2025-11-06 18:49:25,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.69 | bwd: 739.29 | bwd_inner: 1.75 | bwd_allreduce: 737.36 | step: 2.80
 75%|███████▍  | 2623/3507 [1:04:38<18:02,  1.22s/it]                                                     {'loss': 0.4008, 'learning_rate': 3.1527783829496483e-06, 'epoch': 0.75}
 75%|███████▍  | 2623/3507 [1:04:38<18:02,  1.22s/it]tensor([[-5.8125, -4.9688,  0.0109,  3.0938, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:25,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.62 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.1875,  1.1484,  3.4844, -1.9922, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -5.1562, -2.8906,  1.7500, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -5.1562,  0.3438,  2.3438, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.4531,  2.3281,  2.1719, -2.9531, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.8125, -4.9688,  0.8320,  2.0156, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -4.2500,  1.4375,  1.4219, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3125, -3.9531,  1.6328, -0.7578, -6.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:49:26,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 18:49:26,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.48 | bwd_microstep: 1344.95 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 1344.13 | step_microstep: 2.74
[2025-11-06 18:49:26,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.12 | bwd: 1345.68 | bwd_inner: 1.36 | bwd_allreduce: 1344.18 | step: 2.82
 75%|███████▍  | 2624/3507 [1:04:40<19:53,  1.35s/it]                                                     {'loss': 0.2511, 'learning_rate': 3.1460491501604207e-06, 'epoch': 0.75}
 75%|███████▍  | 2624/3507 [1:04:40<19:53,  1.35s/it]tensor([[-5.5000, -4.0938,  0.2188,  1.9297, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -3.7188,  0.4023,  1.9141, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:26,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.63 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.5078,  1.9766,  2.5000, -2.0312, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1562,  1.8828,  2.4844, -2.9219, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[ 0.1992,  3.7812,  3.9062, -0.7617, -1.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.8125, -5.9062, -2.9531,  0.4844, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -3.6250,  1.3750,  0.2656, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -4.7812, -1.2422,  3.2812, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:28,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:49:28,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 67.33 | bwd_microstep: 1410.08 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 1409.23 | step_microstep: 2.03
[2025-11-06 18:49:28,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 268.98 | bwd: 1410.80 | bwd_inner: 1.39 | bwd_allreduce: 1409.29 | step: 2.12
 75%|███████▍  | 2625/3507 [1:04:42<21:28,  1.46s/it]                                                     {'loss': 0.5218, 'learning_rate': 3.1393257656414842e-06, 'epoch': 0.75}
 75%|███████▍  | 2625/3507 [1:04:42<21:28,  1.46s/it]tensor([[-4.7500, -0.6602,  3.3594, -0.8672, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -1.1641,  2.2656,  0.8945, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.7812, -6.0000, -0.3105,  1.2266, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:28,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.32 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7500, -2.3438,  0.9023, -2.1719, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5156, -1.5078,  1.0469,  0.1396, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -3.0781,  0.2334,  1.5781, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -4.5625,  0.1523,  1.5234, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -3.6094,  0.9102,  1.1953, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:29,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:49:29,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.13 | bwd_microstep: 834.62 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 833.61 | step_microstep: 1.87
[2025-11-06 18:49:29,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.49 | bwd: 835.53 | bwd_inner: 1.72 | bwd_allreduce: 833.65 | step: 1.95
 75%|███████▍  | 2626/3507 [1:04:43<20:34,  1.40s/it]                                                     {'loss': 0.4271, 'learning_rate': 3.1326082351297025e-06, 'epoch': 0.75}
 75%|███████▍  | 2626/3507 [1:04:43<20:34,  1.40s/it]tensor([[-4.8125, -3.3594,  0.5469,  2.0312, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9375, -4.1875, -0.8633,  3.5000, -1.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:29,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.52 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.4062, -1.0078,  3.6406, -1.2969, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -4.2812, -0.8008,  2.3906, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -0.7383,  3.5625, -1.1953, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3438, -5.6875, -1.1641,  2.0312, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8516,  0.5078,  3.9844,  3.5156, -1.1016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -1.6250,  1.8672,  0.8906, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:31,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:49:31,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.50 | bwd_microstep: 1271.36 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1270.31 | step_microstep: 2.12
[2025-11-06 18:49:31,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.03 | bwd: 1272.16 | bwd_inner: 1.69 | bwd_allreduce: 1270.35 | step: 2.18
 75%|███████▍  | 2627/3507 [1:04:45<21:39,  1.48s/it]                                                     {'loss': 0.1541, 'learning_rate': 3.1258965643569382e-06, 'epoch': 0.75}
 75%|███████▍  | 2627/3507 [1:04:45<21:39,  1.48s/it]tensor([[-5.4688, -1.5234,  3.2969, -0.6133, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -2.9375,  1.3828,  2.2500, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7344, -0.8633,  2.6406,  0.9375, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:31,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -2.6094,  1.8438,  0.2715, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -2.3750,  2.4062, -0.8320, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.6250, -6.6250, -0.7773,  2.3594, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5312, -2.8750,  2.8750,  0.0513, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2812,  0.6289,  2.7031, -2.0938, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:49:32,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:49:32,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.87 | bwd_microstep: 280.35 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 279.07 | step_microstep: 2.04
[2025-11-06 18:49:32,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.08 | bwd: 281.26 | bwd_inner: 1.96 | bwd_allreduce: 279.13 | step: 2.14
 75%|███████▍  | 2628/3507 [1:04:45<18:09,  1.24s/it]                                                     {'loss': 0.5952, 'learning_rate': 3.119190759050069e-06, 'epoch': 0.75}
 75%|███████▍  | 2628/3507 [1:04:45<18:09,  1.24s/it]tensor([[-5.8438, -5.8438, -2.4844,  1.3672, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:32,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.89 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.9062, -5.5625,  0.0352,  2.3750, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.4531,  1.0625,  1.9219, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[ 0.8008,  4.3750,  4.5000, -0.5195, -0.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.6562, -0.4199,  3.1094,  2.3438, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-11.3125,  -7.6562,  -1.6250,  -3.7812,  -9.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.3750,  0.1099,  2.1562, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -4.7500, -0.2441,  2.5312, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:34,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.21 | optimizer_step: 0.34
[2025-11-06 18:49:34,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.08 | bwd_microstep: 2392.10 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 2391.14 | step_microstep: 2.77
[2025-11-06 18:49:34,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.99 | bwd: 2392.85 | bwd_inner: 1.48 | bwd_allreduce: 2391.21 | step: 2.86
 75%|███████▍  | 2629/3507 [1:04:48<24:50,  1.70s/it]                                                     {'loss': 0.3563, 'learning_rate': 3.112490824930946e-06, 'epoch': 0.75}
 75%|███████▍  | 2629/3507 [1:04:48<24:50,  1.70s/it]tensor([[-3.3594, -4.0312, -1.6484,  2.8750, -0.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -2.6719,  1.2969,  0.6914, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -1.9922,  2.6562,  0.1416, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:35,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.51 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.3125, -4.1250,  1.2266,  1.8281, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406, -0.2227,  0.8477, -0.8047, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0938, -3.9062, -0.1729,  1.4141, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.0625, -5.5312,  1.1641, -0.5195, -7.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -5.4688, -2.4688,  2.2031, -1.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:35,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:49:35,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.27 | bwd_microstep: 68.67 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 67.45 | step_microstep: 2.22
[2025-11-06 18:49:35,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.81 | bwd: 69.67 | bwd_inner: 2.03 | bwd_allreduce: 67.49 | step: 2.30
 75%|███████▍  | 2630/3507 [1:04:49<19:36,  1.34s/it]                                                     {'loss': 0.521, 'learning_rate': 3.1057967677164258e-06, 'epoch': 0.75}
 75%|███████▍  | 2630/3507 [1:04:49<19:36,  1.34s/it]tensor([[-1.7578,  1.8516,  2.1875, -2.5156, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:49:35,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.26 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.0625, -1.7578,  1.8906,  1.5078, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1406, -1.9062,  0.5000,  1.5547, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -4.1250,  1.2344,  3.0781, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -4.3125, -0.8555,  2.9531, -1.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562,  1.0156,  3.8438, -2.6719, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -2.0312,  2.0938,  3.9219, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5312, -5.5000,  0.3145,  1.3672, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:36,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.58 | optimizer_gradients: 0.30 | optimizer_step: 0.31
[2025-11-06 18:49:36,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.15 | bwd_microstep: 747.89 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 746.66 | step_microstep: 4.04
[2025-11-06 18:49:36,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.44 | bwd: 748.89 | bwd_inner: 2.07 | bwd_allreduce: 746.70 | step: 4.12
 75%|███████▌  | 2631/3507 [1:04:50<18:41,  1.28s/it]                                                     {'loss': 0.6732, 'learning_rate': 3.0991085931183418e-06, 'epoch': 0.75}
 75%|███████▌  | 2631/3507 [1:04:50<18:41,  1.28s/it]tensor([[-5.5938, -4.0625,  1.2031,  2.9219, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:36,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.80 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.10
tensor([[-3.6719,  0.2480,  3.8438, -0.6914, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -2.8281,  1.0234,  1.1250, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -3.7656,  0.0503,  0.9258, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6250,  2.5781,  3.5312, -2.2188, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.6719,  0.1099,  2.1719, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -0.0055,  3.7344, -1.7266, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -3.3281,  0.8359,  3.6094, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:38,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:49:38,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.95 | bwd_microstep: 1.73 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.62 | step_microstep: 2.11
[2025-11-06 18:49:38,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.75 | bwd: 2.64 | bwd_inner: 1.86 | bwd_allreduce: 0.65 | step: 2.21
 75%|███████▌  | 2632/3507 [1:04:52<20:55,  1.44s/it]                                                     {'loss': 0.451, 'learning_rate': 3.0924263068435213e-06, 'epoch': 0.75}
 75%|███████▌  | 2632/3507 [1:04:52<20:55,  1.44s/it]tensor([[-3.2656, -3.2969, -0.8945,  2.4062, -1.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2500, -1.1641,  2.2344,  1.6562, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -4.5312,  0.2002,  2.7656, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0781,  2.2812,  3.0156, -1.0078, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -4.7812, -1.2734,  2.4688, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -1.6172,  2.7969, -1.5078, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:38,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.19 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8750, -2.7500,  1.0156,  0.6250, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -4.0000, -0.7578,  2.0312, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:38,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:49:38,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.11 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 1.57
[2025-11-06 18:49:38,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.32 | bwd: 2.68 | bwd_inner: 1.80 | bwd_allreduce: 0.77 | step: 1.65
 75%|███████▌  | 2633/3507 [1:04:52<16:57,  1.16s/it]                                                     {'loss': 0.3367, 'learning_rate': 3.085749914593752e-06, 'epoch': 0.75}
 75%|███████▌  | 2633/3507 [1:04:52<16:57,  1.16s/it][h264 @ 0xc411e40] mmco: unref short failure
[h264 @ 0xc411e40] mmco: unref short failure
tensor([[-4.5312,  0.3438,  3.9688, -2.3594, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6719, -3.6250, -2.4531,  1.7891, -0.2324]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -3.3281,  0.3477,  1.9141, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:38,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.10 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0312, -4.9375, -0.9883,  3.1562, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9688, -4.0625,  1.9297,  0.9883, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -0.4512,  2.3281, -2.2031, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -1.6484,  1.4766,  0.8789, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -2.6875,  1.7734,  1.1328, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:40,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:49:40,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.36 | bwd_microstep: 4.54 | bwd_inner_microstep: 3.55 | bwd_allreduce_microstep: 0.90 | step_microstep: 3.12
[2025-11-06 18:49:40,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 397.48 | bwd: 5.42 | bwd_inner: 4.35 | bwd_allreduce: 0.93 | step: 3.19
 75%|███████▌  | 2634/3507 [1:04:54<20:26,  1.41s/it]                                                     {'loss': 0.2151, 'learning_rate': 3.0790794220658047e-06, 'epoch': 0.75}
 75%|███████▌  | 2634/3507 [1:04:54<20:26,  1.41s/it]tensor([[-7.5312, -4.3438,  1.8906,  0.5312, -5.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -3.8125,  0.7344,  2.5469, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -3.4844,  0.4707,  0.4121, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -3.3281, -1.1562,  2.2031, -0.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:40,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.76 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4844, -2.6094,  1.5703,  4.0625, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3906,  1.0625,  1.5625, -2.5938, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -3.6250, -2.6875,  0.6367, -0.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9062, -4.1875, -2.2969,  1.2188, -1.5078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:49:41,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:49:41,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.07 | bwd_microstep: 39.26 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 38.41 | step_microstep: 2.05
[2025-11-06 18:49:41,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.84 | bwd: 39.94 | bwd_inner: 1.35 | bwd_allreduce: 38.45 | step: 2.13
 75%|███████▌  | 2635/3507 [1:04:54<16:08,  1.11s/it]                                                     {'loss': 0.2614, 'learning_rate': 3.0724148349513995e-06, 'epoch': 0.75}
 75%|███████▌  | 2635/3507 [1:04:55<16:08,  1.11s/it]tensor([[-2.7188,  1.2734,  3.7031, -1.4453, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -1.6328,  3.0156,  1.2188, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -0.7539,  1.9219, -0.4980, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:41,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.47 | bwd_microstep: 5.46 | bwd_inner_microstep: 5.33 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.7500,  2.0938,  1.7422, -1.4375, -1.4609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:0')
tensor([[-4.5000, -1.3750,  2.8906,  0.5469, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5312, -6.7188, -1.4062,  1.9141, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -1.7578,  1.1562,  0.6719, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -3.7656,  0.0884,  2.8750, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:44,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:49:44,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.59 | bwd_microstep: 2790.66 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 2789.59 | step_microstep: 1.79
[2025-11-06 18:49:44,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.09 | bwd: 2796.12 | bwd_inner: 6.31 | bwd_allreduce: 2789.65 | step: 1.88
 75%|███████▌  | 2636/3507 [1:04:58<25:02,  1.73s/it]                                                     {'loss': 0.6332, 'learning_rate': 3.0657561589372377e-06, 'epoch': 0.75}
 75%|███████▌  | 2636/3507 [1:04:58<25:02,  1.73s/it]tensor([[-0.4082,  2.6406,  2.9062, -0.9844, -1.3516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:49:44,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.84 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8125, -2.0312,  2.0781,  0.4922, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250, -0.3613,  2.5781, -0.3965, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -2.5000,  2.2969,  0.0991, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -0.5859,  3.2031, -2.6094, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -3.7344,  2.0156,  1.0156, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -4.5625, -0.4570,  2.9375, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -1.8984,  3.2188, -1.7969, -6.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:49:44,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:49:44,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 328.16 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 326.89 | step_microstep: 1.59
[2025-11-06 18:49:44,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.27 | bwd: 329.10 | bwd_inner: 2.04 | bwd_allreduce: 326.93 | step: 1.66
 75%|███████▌  | 2637/3507 [1:04:58<20:16,  1.40s/it]                                                     {'loss': 0.486, 'learning_rate': 3.0591033997049646e-06, 'epoch': 0.75}
 75%|███████▌  | 2637/3507 [1:04:58<20:16,  1.40s/it]tensor([[-3.3906,  1.6953,  4.5938, -2.4219, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:45,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 40.33 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5938, -5.5938, -2.4375,  1.3359, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -3.4688, -0.0728,  3.5156, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1875, -4.2812,  1.0938,  0.2598, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.9688, -0.9219,  1.5000, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8125, -3.2344, -1.4766,  2.0000, -0.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -4.5000,  0.1953,  2.0938, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -4.4688, -1.0859,  3.2812, -1.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:46,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.22 | optimizer_step: 0.30
[2025-11-06 18:49:46,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 250.60 | bwd_microstep: 1046.84 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 1045.42 | step_microstep: 2.27
[2025-11-06 18:49:46,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.94 | bwd: 1047.76 | bwd_inner: 2.16 | bwd_allreduce: 1045.47 | step: 2.34
 75%|███████▌  | 2638/3507 [1:05:00<20:07,  1.39s/it]                                                     {'loss': 0.2151, 'learning_rate': 3.0524565629311787e-06, 'epoch': 0.75}
 75%|███████▌  | 2638/3507 [1:05:00<20:07,  1.39s/it]tensor([[-9.8125, -8.3750, -2.5156, -0.1816, -6.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -0.6602,  2.6719, -0.7188, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781,  0.0172,  3.7344,  0.2637, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.5469, -0.1055,  1.5703, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -4.4375,  0.2432, -0.0693, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:46,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.35 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7812, -4.5625, -2.3594,  2.1562, -1.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.2188, -4.4062, -0.9062,  3.5938, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7969, -1.1484,  1.6172, -0.1094, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:47,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:49:47,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.68 | bwd_microstep: 911.73 | bwd_inner_microstep: 1.89 | bwd_allreduce_microstep: 909.73 | step_microstep: 2.04
[2025-11-06 18:49:47,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.06 | bwd: 912.56 | bwd_inner: 2.64 | bwd_allreduce: 909.77 | step: 2.13
 75%|███████▌  | 2639/3507 [1:05:01<19:41,  1.36s/it]                                                     {'loss': 1.2802, 'learning_rate': 3.0458156542874283e-06, 'epoch': 0.75}
 75%|███████▌  | 2639/3507 [1:05:01<19:41,  1.36s/it]tensor([[-2.8281, -3.7344, -2.2188,  2.0938, -0.3027]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:49:47,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.30 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.5938, -5.1562,  0.1270,  2.1719, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -6.2500, -2.2656,  2.2031, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.3359,  2.1250, -0.7891, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438, -4.0938,  1.6562,  0.6719, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6094,  0.7656,  3.2344, -0.0752, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -4.3125, -1.3672,  2.6094, -1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.0781,  1.1406,  2.9062, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:49,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:49:49,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.39 | bwd_microstep: 1651.19 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1650.08 | step_microstep: 1.92
[2025-11-06 18:49:49,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.70 | bwd: 1652.20 | bwd_inner: 1.90 | bwd_allreduce: 1650.13 | step: 2.02
 75%|███████▌  | 2640/3507 [1:05:03<22:36,  1.57s/it]                                                     {'loss': 0.4212, 'learning_rate': 3.039180679440199e-06, 'epoch': 0.75}
 75%|███████▌  | 2640/3507 [1:05:03<22:36,  1.57s/it]tensor([[-5.0625, -4.5625, -0.2695,  3.1719, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -1.4609,  2.3594,  0.9961, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -4.5000, -0.2471,  2.8125, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -3.3281,  0.7539,  0.1016, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:49,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.01 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.5000, -1.6797,  1.8672,  0.2432, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -2.6562, -0.4629,  1.7734, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -4.6250, -0.8125,  2.1250, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -4.0938,  0.3828,  1.0469, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:50,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:49:50,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.24 | bwd_microstep: 1.72 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.64 | step_microstep: 1.54
[2025-11-06 18:49:50,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.26 | bwd: 2.71 | bwd_inner: 1.88 | bwd_allreduce: 0.69 | step: 1.63
 75%|███████▌  | 2641/3507 [1:05:03<17:41,  1.23s/it]                                                     {'loss': 0.6584, 'learning_rate': 3.032551644050917e-06, 'epoch': 0.75}
 75%|███████▌  | 2641/3507 [1:05:03<17:41,  1.23s/it]tensor([[-4.8125, -2.9531,  1.4766,  2.3125, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438,  0.8477,  3.5312, -1.7500, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -5.3438, -3.7812,  0.5078, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:49:50,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.74 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7188, -2.9688,  2.8438,  0.0247, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -4.9688,  0.1445,  2.3438, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -5.1562, -0.1021,  3.2969, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3750, -2.8750,  2.4531, -0.0737, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1094, -2.6562, -0.4609,  3.8594,  0.2871]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:52,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:49:52,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.72 | bwd_microstep: 2366.69 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 2365.67 | step_microstep: 2.24
[2025-11-06 18:49:52,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.48 | bwd: 2367.64 | bwd_inner: 1.77 | bwd_allreduce: 2365.73 | step: 2.32
 75%|███████▌  | 2642/3507 [1:05:06<24:30,  1.70s/it]                                                     {'loss': 0.8318, 'learning_rate': 3.0259285537759375e-06, 'epoch': 0.75}
 75%|███████▌  | 2642/3507 [1:05:06<24:30,  1.70s/it]tensor([[-4.2188, -3.0312,  1.1094,  2.9531, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -1.2969,  2.2969, -1.0156, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -5.0000, -3.1406,  2.0781, -0.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -2.0469,  3.4375,  0.1875, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:53,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.67 | bwd_microstep: 2.53 | bwd_inner_microstep: 2.42 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -2.5469,  1.6016, -0.5039, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -4.9375, -1.3828,  1.9609, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.3906,  1.4531,  1.4766, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -4.9375, -1.8984,  0.9453, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:53,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:49:53,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.05 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.00
[2025-11-06 18:49:53,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.75 | bwd: 4.57 | bwd_inner: 3.52 | bwd_allreduce: 0.90 | step: 2.09
 75%|███████▌  | 2643/3507 [1:05:07<19:27,  1.35s/it]                                                     {'loss': 0.1632, 'learning_rate': 3.0193114142665424e-06, 'epoch': 0.75}
 75%|███████▌  | 2643/3507 [1:05:07<19:27,  1.35s/it]tensor([[-2.3281,  0.3047,  1.6875, -0.7266, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:49:53,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 68.21 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9062, -3.3281,  0.8398,  1.7578, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.6719, -0.2617,  2.0781, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -3.7656, -0.3047,  0.8125, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7031, -4.0312, -1.8438,  2.0312, -1.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -4.8438, -2.7656,  2.1719, -0.9258]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8906, -2.0781,  1.4531,  3.6406, -1.1172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -1.4688,  2.2812,  1.3359, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:56,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.23 | optimizer_step: 0.34
[2025-11-06 18:49:56,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.86 | bwd_microstep: 1998.57 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1997.56 | step_microstep: 2.72
[2025-11-06 18:49:56,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 223.08 | bwd: 1999.26 | bwd_inner: 1.49 | bwd_allreduce: 1997.60 | step: 2.80
 75%|███████▌  | 2644/3507 [1:05:10<26:52,  1.87s/it]                                                     {'loss': 0.327, 'learning_rate': 3.0127002311689446e-06, 'epoch': 0.75}
 75%|███████▌  | 2644/3507 [1:05:10<26:52,  1.87s/it]tensor([[-3.9062, -1.0938,  2.6719,  0.8125, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6953, -2.6875, -1.7422,  2.6094,  0.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -4.1875, -0.2871,  1.4766, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:56,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.95 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.4375, -3.7969,  0.3867,  3.1875, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -2.1406,  2.1562,  0.5352, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.2500, -3.3594,  2.5625, -0.8359, -6.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0000, -5.0625, -1.5391, -1.6250, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -3.0156,  1.0859,  3.7344, -1.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:49:58,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.26
[2025-11-06 18:49:58,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.90 | bwd_microstep: 1094.14 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 1092.52 | step_microstep: 6.12
[2025-11-06 18:49:58,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.87 | bwd: 1095.21 | bwd_inner: 2.44 | bwd_allreduce: 1092.59 | step: 6.22
 75%|███████▌  | 2645/3507 [1:05:11<25:13,  1.76s/it]                                                     {'loss': 0.2035, 'learning_rate': 3.006095010124267e-06, 'epoch': 0.75}
 75%|███████▌  | 2645/3507 [1:05:11<25:13,  1.76s/it]tensor([[-4.2500, -3.2812,  1.0781,  3.5625, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6719,  1.7422,  2.2812, -1.6875, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.6875, -4.5625, -0.6406,  3.1250, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:58,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.57 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.3125,  1.7734,  3.4375, -1.9062, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -5.5000, -0.9883,  2.3125, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -1.0000,  2.9375,  0.0596, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -2.8594,  1.6016,  1.4453, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -5.6875, -1.5156,  1.9062, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:49:59,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:49:59,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.03 | bwd_microstep: 971.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 971.04 | step_microstep: 2.94
[2025-11-06 18:49:59,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.63 | bwd: 972.61 | bwd_inner: 1.38 | bwd_allreduce: 971.08 | step: 3.03
 75%|███████▌  | 2646/3507 [1:05:13<23:33,  1.64s/it]                                                     {'loss': 0.593, 'learning_rate': 2.99949575676854e-06, 'epoch': 0.75}
 75%|███████▌  | 2646/3507 [1:05:13<23:33,  1.64s/it]tensor([[-5.5938, -3.2656,  1.5938,  1.4766, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7812, -1.6562,  1.8125,  3.6250, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:49:59,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.06 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-5.4375, -3.1719,  1.2422,  0.3164, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9375, -4.0000,  1.1406,  1.9844, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -1.0859,  2.4219,  1.0234, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -2.1094,  2.1719,  0.6250, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.6562, -6.2188, -0.0311,  2.5469, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.0000,  1.4531,  1.2031, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:50:00,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:50:00,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.57 | bwd_microstep: 356.03 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 354.93 | step_microstep: 1.78
[2025-11-06 18:50:00,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.64 | bwd: 357.27 | bwd_inner: 2.10 | bwd_allreduce: 355.00 | step: 1.89
 75%|███████▌  | 2647/3507 [1:05:14<19:57,  1.39s/it]                                                     {'loss': 0.8479, 'learning_rate': 2.9929024767327088e-06, 'epoch': 0.75}
 75%|███████▌  | 2647/3507 [1:05:14<19:57,  1.39s/it]tensor([[-4.4375, -1.0312,  2.5469, -0.7227, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:00,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.80 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4688,  2.4844,  4.3438, -0.6992, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5312, -2.1875,  1.6328,  0.9805, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -3.3594,  2.3438,  1.0938, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -5.1562, -0.4160,  3.7188, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -4.5312, -2.4531,  2.0312, -1.0078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -1.7031,  2.9531,  2.4531, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375,  0.3125,  3.4375, -0.3633, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:50:01,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:50:01,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.37 | bwd_microstep: 925.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 925.04 | step_microstep: 1.80
[2025-11-06 18:50:01,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.20 | bwd: 926.81 | bwd_inner: 1.57 | bwd_allreduce: 925.09 | step: 1.88
 76%|███████▌  | 2648/3507 [1:05:15<19:22,  1.35s/it]                                                     {'loss': 0.5075, 'learning_rate': 2.9863151756426255e-06, 'epoch': 0.76}
 76%|███████▌  | 2648/3507 [1:05:15<19:22,  1.35s/it]tensor([[-7.7188, -6.1250, -0.2188,  1.8438, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -3.1719,  2.1875, -0.4570, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.9375, -6.0625,  0.3457,  2.0312, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:01,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.99 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7188, -1.4219,  3.6250, -1.0781, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -2.2344,  1.3750,  1.6328, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.1875, -5.2812,  0.2217,  1.3750, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -3.2188, -0.2031,  1.9219, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9688, -3.0156, -1.1406,  1.7109, -0.9180]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:50:03,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.45 | optimizer_step: 0.42
[2025-11-06 18:50:03,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.07 | bwd_microstep: 1911.78 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1910.64 | step_microstep: 3.51
[2025-11-06 18:50:03,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 452.09 | bwd: 1912.63 | bwd_inner: 1.62 | bwd_allreduce: 1910.76 | step: 3.59
 76%|███████▌  | 2649/3507 [1:05:17<23:59,  1.68s/it]                                                     {'loss': 0.5196, 'learning_rate': 2.9797338591190362e-06, 'epoch': 0.76}
 76%|███████▌  | 2649/3507 [1:05:17<23:59,  1.68s/it]tensor([[-4.6875, -4.7500, -0.9727,  3.1406, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -2.2500,  1.0625, -1.4453, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -4.5938, -0.4141,  2.8438, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:04,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.35 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -3.1562,  1.6016,  0.2793, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -3.1875,  0.1816,  1.3125, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -5.6875, -3.0469,  1.0703, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938,  1.1094,  3.9688, -0.1123, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -2.9219,  1.4219, -0.8281, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:50:05,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.32
[2025-11-06 18:50:05,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.16 | bwd_microstep: 1479.57 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1478.67 | step_microstep: 2.49
[2025-11-06 18:50:05,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.53 | bwd: 1480.29 | bwd_inner: 1.41 | bwd_allreduce: 1478.73 | step: 2.57
 76%|███████▌  | 2650/3507 [1:05:19<24:55,  1.75s/it]                                                     {'loss': 0.2626, 'learning_rate': 2.9731585327775814e-06, 'epoch': 0.76}
 76%|███████▌  | 2650/3507 [1:05:19<24:55,  1.75s/it]tensor([[-2.8750,  0.1748,  1.3672, -1.3906, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:05,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.33 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5312, -2.5625,  2.8125,  3.5625, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719,  1.4453,  3.5000, -2.7812, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7500, -3.5469, -0.7539,  0.2412, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6875, -4.2188, -1.7969,  2.5312, -0.9883]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9844,  1.9922,  2.9375, -2.6406, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.3594, -0.1924,  2.3906,  1.7422, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -2.4531,  3.1562, -0.5703, -5.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:50:06,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:50:06,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.38 | bwd_microstep: 677.81 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 676.64 | step_microstep: 1.67
[2025-11-06 18:50:06,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.72 | bwd: 678.76 | bwd_inner: 1.92 | bwd_allreduce: 676.69 | step: 1.77
 76%|███████▌  | 2651/3507 [1:05:20<21:41,  1.52s/it]                                                     {'loss': 0.3817, 'learning_rate': 2.966589202228781e-06, 'epoch': 0.76}
 76%|███████▌  | 2651/3507 [1:05:20<21:41,  1.52s/it]tensor([[-2.8125, -3.1406, -0.2041,  4.0312, -0.3535]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9375, -1.8203,  0.9141,  3.8281, -0.1128]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.4531,  1.6406,  1.0000, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9219,  0.5781,  3.1250, -0.6445, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -4.7500, -0.4434,  2.4844, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -4.0000,  1.5234,  2.3594, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5625, -0.7812,  3.3750,  2.1875, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:07,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.34 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4375, -2.7812,  0.8906,  3.3906, -1.3984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:08,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:50:08,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.35 | bwd_microstep: 2.14 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1.06 | step_microstep: 1.97
[2025-11-06 18:50:08,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.63 | bwd: 3.02 | bwd_inner: 1.74 | bwd_allreduce: 1.10 | step: 2.06
 76%|███████▌  | 2652/3507 [1:05:22<21:17,  1.49s/it]                                                     {'loss': 0.5428, 'learning_rate': 2.9600258730780564e-06, 'epoch': 0.76}
 76%|███████▌  | 2652/3507 [1:05:22<21:17,  1.49s/it]tensor([[-5.0625, -1.8750,  2.1250, -0.1523, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:08,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.37 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.6562, -0.5391,  2.9062,  0.2695, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -3.0156,  1.6484,  0.7969, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.1875, -7.7500, -2.4062,  1.5156, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -4.3750,  0.2930,  2.5312, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -1.7500,  1.3750,  0.8203, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -4.5625, -0.9062,  2.4219, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -0.2656,  3.7344, -2.2500, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:50:09,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:50:09,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.11 | bwd_microstep: 1252.60 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1251.56 | step_microstep: 1.98
[2025-11-06 18:50:09,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.50 | bwd: 1254.57 | bwd_inner: 2.81 | bwd_allreduce: 1251.61 | step: 2.06
 76%|███████▌  | 2653/3507 [1:05:23<21:50,  1.53s/it]                                                     {'loss': 1.1203, 'learning_rate': 2.9534685509256954e-06, 'epoch': 0.76}
 76%|███████▌  | 2653/3507 [1:05:23<21:50,  1.53s/it]tensor([[-5.4375, -1.6562,  2.2344, -1.7344, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.0938, -3.2500,  0.7500,  3.2344, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156,  0.3184,  2.4219, -1.8672, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.9062, -3.5781,  1.0234,  2.9375, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -4.0938,  1.4062,  2.1562, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0312, -0.4668,  1.9922,  0.3613, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -4.5938, -1.1328,  2.4062, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:11,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.68 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.3438, -2.2188,  3.2969, -0.7227, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:12,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:50:12,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 335.13 | bwd_microstep: 2.28 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 0.82 | step_microstep: 3.09
[2025-11-06 18:50:12,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 509.84 | bwd: 3.19 | bwd_inner: 2.20 | bwd_allreduce: 0.86 | step: 3.18
 76%|███████▌  | 2654/3507 [1:05:25<24:27,  1.72s/it]                                                     {'loss': 0.8896, 'learning_rate': 2.9469172413668647e-06, 'epoch': 0.76}
 76%|███████▌  | 2654/3507 [1:05:25<24:27,  1.72s/it]tensor([[-5.4062, -3.0312,  1.5234,  0.7227, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.1250, -6.3125, -0.3164, -0.9453, -6.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:12,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.38 | bwd_microstep: 4.72 | bwd_inner_microstep: 4.59 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.1562, -3.2969,  1.4219,  0.1367, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -1.0938,  2.5938, -1.8438, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -3.1562,  1.1797,  1.1641, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -1.9766,  1.7109,  0.5977, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2188,  0.1699,  1.9766, -1.5078, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -2.7812,  1.8125,  1.8828, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:50:12,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:50:12,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.74 | bwd_microstep: 194.42 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 193.27 | step_microstep: 2.04
[2025-11-06 18:50:12,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.15 | bwd: 199.15 | bwd_inner: 5.69 | bwd_allreduce: 193.32 | step: 2.14
 76%|███████▌  | 2655/3507 [1:05:26<19:41,  1.39s/it]                                                     {'loss': 0.4786, 'learning_rate': 2.9403719499916008e-06, 'epoch': 0.76}
 76%|███████▌  | 2655/3507 [1:05:26<19:41,  1.39s/it]tensor([[-6.0312, -4.7812, -0.8477,  0.6484, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.8906, -0.9883,  1.7109, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -0.3477,  2.6562, -1.1797, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -3.1719,  0.0564,  2.0938, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -2.4688,  1.4453,  1.0625, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -3.1094,  0.8438,  2.0938, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -4.5312, -1.4609,  2.6094, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:14,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.59 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0625, -4.7500, -0.8398,  2.7812, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:14,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:50:14,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.65 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.78
[2025-11-06 18:50:14,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.25 | bwd: 2.80 | bwd_inner: 1.67 | bwd_allreduce: 0.96 | step: 2.87
 76%|███████▌  | 2656/3507 [1:05:28<21:24,  1.51s/it]                                                     {'loss': 0.3299, 'learning_rate': 2.933832682384802e-06, 'epoch': 0.76}
 76%|███████▌  | 2656/3507 [1:05:28<21:24,  1.51s/it]tensor([[-5.5625e+00, -4.4375e+00, -2.4109e-03,  2.2500e+00, -3.1562e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8750, -0.7109,  1.4453, -0.1484, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -2.1875,  1.4062,  1.8047, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:14,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.38 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.5469, -4.3438, -2.2344,  2.2969, -0.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -4.1250,  0.2305,  2.8906, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -2.7812,  1.9141, -0.3926, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3125, -4.2812,  0.6172,  1.2734, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.8594, -0.4199,  0.5781, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:50:15,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:50:15,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.94 | bwd_microstep: 784.04 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 782.95 | step_microstep: 1.89
[2025-11-06 18:50:15,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.34 | bwd: 785.01 | bwd_inner: 1.84 | bwd_allreduce: 783.01 | step: 1.99
 76%|███████▌  | 2657/3507 [1:05:29<20:16,  1.43s/it]                                                     {'loss': 0.8985, 'learning_rate': 2.927299444126229e-06, 'epoch': 0.76}
 76%|███████▌  | 2657/3507 [1:05:29<20:16,  1.43s/it]tensor([[-3.7812, -4.2812, -1.4297,  3.1562, -0.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1406, -2.8438, -2.1406,  1.3984, -0.0243]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -2.6406,  1.7812,  1.5781, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -2.5156,  2.7188, -0.8086, -5.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -1.3750,  3.4688, -0.1826, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -5.2188, -0.4883,  1.2656, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -5.3750, -1.9531,  1.4922, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:17,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.94 | bwd_microstep: 1.21 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12
tensor([[-3.0000, -2.4219,  1.3828,  4.1875, -0.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:17,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.23 | optimizer_step: 0.29
[2025-11-06 18:50:17,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.07 | bwd_microstep: 2.80 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 1.35 | step_microstep: 2.37
[2025-11-06 18:50:17,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.03 | bwd: 4.02 | bwd_inner: 2.31 | bwd_allreduce: 1.44 | step: 2.49
 76%|███████▌  | 2658/3507 [1:05:31<21:17,  1.50s/it]                                                     {'loss': 0.1818, 'learning_rate': 2.9207722407905004e-06, 'epoch': 0.76}
 76%|███████▌  | 2658/3507 [1:05:31<21:17,  1.50s/it]tensor([[-3.6406, -1.0078,  2.2812,  0.8711, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -3.4219,  0.3574,  2.9844, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -3.1094,  1.2500,  2.7188, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562,  0.5000,  4.0312, -0.5664, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:17,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.34 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.14
tensor([[-4.7500, -2.9062,  1.2891,  1.7734, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.4375,  1.6406,  3.6562, -1.5391, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.0312, -3.4219, -0.1357,  0.7656, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([3], device='cuda:3')
tensor([[-4.9688, -2.4375,  1.0391, -0.2988, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:17,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 18:50:17,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.52 | bwd_microstep: 1.68 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.79 | step_microstep: 1.72
[2025-11-06 18:50:17,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.89 | bwd: 2.85 | bwd_inner: 1.82 | bwd_allreduce: 0.85 | step: 1.85
 76%|███████▌  | 2659/3507 [1:05:31<16:54,  1.20s/it]                                                     {'loss': 0.4581, 'learning_rate': 2.914251077947077e-06, 'epoch': 0.76}
 76%|███████▌  | 2659/3507 [1:05:31<16:54,  1.20s/it]tensor([[-5.1562, -5.3750, -1.7266,  2.7656, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0156,  1.6797,  1.6094, -0.8516, -1.3984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.7500, -4.3438,  0.6523,  2.4375, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -2.4531,  1.8281,  1.2109, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3125, -0.5391,  2.0781, -0.4902, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3438, -5.8125, -0.2256,  1.4766, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:20,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.25 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.0000, -3.2031, -0.2432,  3.6250, -0.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6875e+00, -4.2812e+00,  4.5312e-01,  3.6621e-03, -4.9688e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:20,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:50:20,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.49 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.44
[2025-11-06 18:50:20,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 501.78 | bwd: 2.84 | bwd_inner: 1.76 | bwd_allreduce: 0.93 | step: 2.55
 76%|███████▌  | 2660/3507 [1:05:34<23:20,  1.65s/it]                                                     {'loss': 0.6572, 'learning_rate': 2.9077359611602773e-06, 'epoch': 0.76}
 76%|███████▌  | 2660/3507 [1:05:34<23:20,  1.65s/it]tensor([[-4.5938, -3.8438,  0.1787,  2.8594, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:20,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.46 | bwd_microstep: 1.27 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.23
tensor([[-6.0938, -4.5625, -0.3809,  0.7227, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.8125, -4.6250,  1.4844, -0.0157, -6.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5078, -2.2969, -1.8984,  1.3516,  0.3887]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -3.8281,  0.3574,  1.8047, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0391,  2.7500,  5.3125,  0.8359, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -3.8594,  0.4004,  2.6875, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -3.4844,  1.5625,  1.8281, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:50:22,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:50:22,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.64 | bwd_microstep: 1078.47 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1077.23 | step_microstep: 1.89
[2025-11-06 18:50:22,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.12 | bwd: 1079.74 | bwd_inner: 2.26 | bwd_allreduce: 1077.30 | step: 2.13
 76%|███████▌  | 2661/3507 [1:05:35<22:28,  1.59s/it]                                                     {'loss': 1.14, 'learning_rate': 2.9012268959892562e-06, 'epoch': 0.76}
 76%|███████▌  | 2661/3507 [1:05:35<22:28,  1.59s/it]tensor([[-3.8750, -1.3984,  2.5781,  1.5078, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750, -1.7109,  1.9141,  3.6562, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -3.9062, -1.9844,  2.2500, -0.5742]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.7188, -3.5312,  1.2031,  1.3281, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1562, -5.8125, -0.0195,  2.1250, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -0.0742,  2.4531, -2.6094, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.6250, -0.3477,  2.2031, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:22,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.88 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5000, -2.0625,  2.3906,  4.0312, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:22,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:50:22,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.42 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.51
[2025-11-06 18:50:22,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.30 | bwd: 2.81 | bwd_inner: 1.81 | bwd_allreduce: 0.85 | step: 2.60
 76%|███████▌  | 2662/3507 [1:05:36<18:52,  1.34s/it]                                                     {'loss': 0.6955, 'learning_rate': 2.894723887987997e-06, 'epoch': 0.76}
 76%|███████▌  | 2662/3507 [1:05:36<18:52,  1.34s/it]tensor([[-5.7812, -6.4375, -3.8594,  0.5625, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.5312, -3.5312,  0.5742,  1.1016, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:22,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.67 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.2188, -4.6875, -0.7031,  2.4688, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -4.2500, -0.3965,  1.0000, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9688, -3.8906, -1.6016,  3.4844, -0.2178]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -4.7500, -1.8359,  2.6094, -1.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5938, -3.2188,  2.3438, -2.0312, -6.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5352,  0.5664,  2.2188,  3.2656,  0.3496]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:50:25,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:50:25,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.33 | bwd_microstep: 1862.86 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1861.69 | step_microstep: 2.04
[2025-11-06 18:50:25,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.03 | bwd: 1863.56 | bwd_inner: 1.68 | bwd_allreduce: 1861.73 | step: 2.14
 76%|███████▌  | 2663/3507 [1:05:38<22:44,  1.62s/it]                                                     {'loss': 0.9261, 'learning_rate': 2.888226942705319e-06, 'epoch': 0.76}
 76%|███████▌  | 2663/3507 [1:05:38<22:44,  1.62s/it]tensor([[-3.1875,  0.5117,  2.7812, -1.4688, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -0.4922,  1.9453, -2.5469, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -4.4375, -1.0625,  2.8906, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4062, -2.8750,  0.2100,  2.9688, -1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -4.3125,  0.4629,  2.7969, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -3.7812,  0.1562,  2.6250, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -4.1875,  0.5703,  2.5625, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:26,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.01 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9375, -4.7812, -0.9961,  2.7344, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:26,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:50:26,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.12 | bwd_microstep: 2.31 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1.09 | step_microstep: 2.00
[2025-11-06 18:50:26,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.16 | bwd: 3.34 | bwd_inner: 2.06 | bwd_allreduce: 1.12 | step: 2.09
 76%|███████▌  | 2664/3507 [1:05:40<22:54,  1.63s/it]                                                     {'loss': 0.0841, 'learning_rate': 2.881736065684878e-06, 'epoch': 0.76}
 76%|███████▌  | 2664/3507 [1:05:40<22:54,  1.63s/it]tensor([[-3.7188, -0.5859,  2.9531,  0.8008, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -3.1562, -0.6914,  1.6953, -1.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:26,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.96 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6250, -3.0938,  2.5000,  0.1089, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9375, -2.9375,  1.7891,  0.2471, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6484,  0.0583,  1.6172,  0.9648, -1.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -4.6250,  1.1328,  1.5469, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3125, -5.7812, -0.8203,  0.9102, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6719e+00, -2.9375e+00, -2.1057e-03,  1.8906e+00, -1.8672e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:50:27,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.25 | optimizer_step: 0.22
[2025-11-06 18:50:27,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.24 | bwd_microstep: 513.72 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 512.89 | step_microstep: 2.47
[2025-11-06 18:50:27,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.22 | bwd: 514.44 | bwd_inner: 1.34 | bwd_allreduce: 512.94 | step: 2.55
 76%|███████▌  | 2665/3507 [1:05:41<19:52,  1.42s/it]                                                     {'loss': 0.3344, 'learning_rate': 2.875251262465142e-06, 'epoch': 0.76}
 76%|███████▌  | 2665/3507 [1:05:41<19:52,  1.42s/it]tensor([[-6.2500, -4.0938, -0.3477, -0.4414, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -2.0000,  1.0938,  0.7031, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -4.1875,  0.1064,  2.9062, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4609,  2.2812,  3.0938, -2.0312, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4844,  2.1250,  3.1250, -1.9062, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.0078,  2.5781,  2.3125, -2.5938, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -1.9688,  1.5703,  0.5273, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:28,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.70 | bwd_microstep: 1.29 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.5938, -3.5625,  2.4375,  1.0391, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:28,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:50:28,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 2.49 | bwd_inner_microstep: 1.60 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.78
[2025-11-06 18:50:28,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.83 | bwd: 3.78 | bwd_inner: 2.81 | bwd_allreduce: 0.84 | step: 1.86
 76%|███████▌  | 2666/3507 [1:05:42<18:01,  1.29s/it]                                                     {'loss': 0.788, 'learning_rate': 2.8687725385793973e-06, 'epoch': 0.76}
 76%|███████▌  | 2666/3507 [1:05:42<18:01,  1.29s/it]tensor([[-5.8125, -5.1562, -0.1504,  3.4844, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7500, -5.0312,  0.3418,  1.6953, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -5.6250, -2.2656,  2.3281, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:28,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.89 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8438, -3.9062,  1.5781,  2.4219, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -2.1875,  1.1562,  0.8164, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -1.0078,  3.2656, -0.8438, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0469,  2.4844,  3.5312, -2.6250, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8125, -4.9375, -0.6367,  0.0559, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:50:31,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 18:50:31,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.13 | bwd_microstep: 2386.42 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 2385.35 | step_microstep: 2.20
[2025-11-06 18:50:31,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.05 | bwd: 2387.38 | bwd_inner: 1.84 | bwd_allreduce: 2385.40 | step: 2.28
 76%|███████▌  | 2667/3507 [1:05:45<24:24,  1.74s/it]                                                     {'loss': 0.3933, 'learning_rate': 2.862299899555746e-06, 'epoch': 0.76}
 76%|███████▌  | 2667/3507 [1:05:45<24:24,  1.74s/it]tensor([[-4.6875, -3.5000,  1.2031,  3.5156, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -2.5156,  0.4609, -2.5312, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -1.4297,  1.6094,  0.8516, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:31,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.29 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -3.4844,  0.7773,  1.5391, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219,  0.0062,  3.3125, -0.2148, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.3438,  0.0461,  1.5625, -0.2373, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.8750, -6.0938, -1.3125,  1.7266, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[0.3320, 2.2656, 3.7344, 3.0312, 0.4355]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:31,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:50:31,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 2.18 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 0.90 | step_microstep: 1.45
[2025-11-06 18:50:31,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.99 | bwd: 2.98 | bwd_inner: 1.91 | bwd_allreduce: 0.93 | step: 1.52
 76%|███████▌  | 2668/3507 [1:05:45<18:48,  1.35s/it]                                                     {'loss': 0.443, 'learning_rate': 2.8558333509170943e-06, 'epoch': 0.76}
 76%|███████▌  | 2668/3507 [1:05:45<18:48,  1.35s/it]tensor([[-5.5938, -3.7500,  0.8125,  1.4531, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -4.2188, -0.9805,  2.9531, -1.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -3.0781,  1.6406, -0.0933, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:31,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.63 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.4902,  2.5781,  2.2656, -1.8438, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5000, -5.4688,  0.2100,  1.0781, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-10.3750,  -9.0625,  -3.2812,  -0.6328,  -6.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -4.1562,  0.9297,  2.8750, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -3.9844, -0.4219,  3.1094, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:50:32,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:50:32,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.72 | bwd_microstep: 75.02 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 73.60 | step_microstep: 1.54
[2025-11-06 18:50:32,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.36 | bwd: 75.89 | bwd_inner: 2.12 | bwd_allreduce: 73.64 | step: 1.62
 76%|███████▌  | 2669/3507 [1:05:46<15:03,  1.08s/it]                                                     {'loss': 0.2674, 'learning_rate': 2.8493728981811553e-06, 'epoch': 0.76}
 76%|███████▌  | 2669/3507 [1:05:46<15:03,  1.08s/it]tensor([[-7.1250, -5.3438,  0.5117,  2.0625, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9062, -4.8750, -0.1128,  0.6758, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -1.3203,  2.3281,  1.6016, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6953,  1.9141,  2.9375, -2.0312, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -4.9062, -0.3555,  2.5156, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6250, -3.2656,  0.2559,  3.4688, -1.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.6875,  0.8789,  1.9922, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:33,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.02 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.5312, -3.1250, -0.6289,  1.8984, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:33,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:50:33,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.79 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.69
[2025-11-06 18:50:33,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.82 | bwd: 2.89 | bwd_inner: 1.76 | bwd_allreduce: 0.95 | step: 2.79
 76%|███████▌  | 2670/3507 [1:05:47<17:49,  1.28s/it]                                                     {'loss': 0.3281, 'learning_rate': 2.842918546860438e-06, 'epoch': 0.76}
 76%|███████▌  | 2670/3507 [1:05:47<17:49,  1.28s/it]tensor([[-6.5938e+00, -5.3438e+00, -1.0452e-03,  2.3906e+00, -3.8281e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4062,  2.2344,  2.7969, -1.5938, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.5938, -1.3438,  1.5469, -1.6797, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0391,  2.3281,  2.0625, -1.8125, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:34,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.19 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1250, -3.3906,  1.1094,  0.2490, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -3.6094,  0.9961,  2.5469, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -4.0625,  0.2041, -0.0518, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[ 0.5781,  3.4062,  3.3750,  0.2695, -0.3105]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:35,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:50:35,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.71 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.02
[2025-11-06 18:50:35,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.92 | bwd: 2.36 | bwd_inner: 1.41 | bwd_allreduce: 0.81 | step: 2.11
 76%|███████▌  | 2671/3507 [1:05:49<19:25,  1.39s/it]                                                     {'loss': 0.6084, 'learning_rate': 2.8364703024622474e-06, 'epoch': 0.76}
 76%|███████▌  | 2671/3507 [1:05:49<19:25,  1.39s/it]tensor([[-3.9688, -1.1953,  2.0469, -0.2852, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -5.4062, -1.9688,  2.8438, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -4.0938, -0.3965,  2.7188, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -0.0752,  3.7812, -1.0859, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -3.2656,  2.4375,  0.7578, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -1.6875,  1.9609, -0.1348, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -5.4688, -1.2266,  2.0938, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:37,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 298.70 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3125, -4.6250, -1.2344,  3.0938, -1.4766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:37,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:50:37,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.35 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.81
[2025-11-06 18:50:37,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.07 | bwd: 2.33 | bwd_inner: 1.36 | bwd_allreduce: 0.83 | step: 2.89
 76%|███████▌  | 2672/3507 [1:05:50<19:49,  1.42s/it]                                                     {'loss': 0.0771, 'learning_rate': 2.8300281704886778e-06, 'epoch': 0.76}
 76%|███████▌  | 2672/3507 [1:05:50<19:49,  1.42s/it]tensor([[-5.2188, -4.4688, -0.2930,  2.6406, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6875, -6.1562, -1.5703,  1.9531, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -1.1719,  2.4375,  1.5391, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:37,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.17 | bwd_microstep: 5.90 | bwd_inner_microstep: 5.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.9531, -2.7500,  0.2871,  1.4688, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -2.7812,  0.6992,  2.4062, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.9688,  0.6055,  2.6719, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -3.9531,  0.6094,  2.6562, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -4.3125, -0.4805,  2.4375, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:39,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:50:39,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.85 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.59
[2025-11-06 18:50:39,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.06 | bwd: 7.94 | bwd_inner: 6.81 | bwd_allreduce: 0.95 | step: 2.70
 76%|███████▌  | 2673/3507 [1:05:53<25:36,  1.84s/it]                                                     {'loss': 0.4822, 'learning_rate': 2.8235921564366043e-06, 'epoch': 0.76}
 76%|███████▌  | 2673/3507 [1:05:53<25:36,  1.84s/it]tensor([[-5.0312, -4.0625,  0.1445,  2.2812, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -3.6094,  0.2158,  3.9688, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -2.7344,  1.6719,  1.3516, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:40,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.22 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.5312, -4.5938,  0.4355,  1.2578, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -3.4062,  1.8828,  1.0391, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9062, -0.6250,  1.8125,  0.8789, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -3.9375,  0.2832,  1.9844, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5938, -5.2500,  1.2969,  1.8047, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:50:40,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:50:40,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.51 | bwd_microstep: 26.67 | bwd_inner_microstep: 5.38 | bwd_allreduce_microstep: 21.21 | step_microstep: 3.86
[2025-11-06 18:50:40,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.77 | bwd: 27.55 | bwd_inner: 6.11 | bwd_allreduce: 21.26 | step: 3.96
 76%|███████▌  | 2674/3507 [1:05:54<19:44,  1.42s/it]                                                     {'loss': 0.414, 'learning_rate': 2.817162265797685e-06, 'epoch': 0.76}
 76%|███████▌  | 2674/3507 [1:05:54<19:44,  1.42s/it]tensor([[-3.7969,  0.2637,  2.1719, -2.7812, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -2.8906,  1.1328, -0.0928, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:40,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.11 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.8125, -4.6562, -3.1094,  0.9805, -1.1641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -3.3906,  0.6836,  0.9375, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -3.3281,  1.0703,  2.3438, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -1.4219,  1.2031, -0.1162, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -3.8594,  1.5391,  1.4453, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -2.3281,  2.2500,  0.3691, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:43,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:50:43,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.27 | bwd_microstep: 1.76 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.42
[2025-11-06 18:50:43,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.40 | bwd: 2.77 | bwd_inner: 1.73 | bwd_allreduce: 0.88 | step: 3.53
 76%|███████▋  | 2675/3507 [1:05:57<27:38,  1.99s/it]                                                     {'loss': 0.7023, 'learning_rate': 2.81073850405835e-06, 'epoch': 0.76}
 76%|███████▋  | 2675/3507 [1:05:57<27:38,  1.99s/it]tensor([[-4.6875, -4.6875, -1.1094,  2.6719, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:43,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.31 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.6875, -4.3438,  1.0312,  1.1016, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -5.0000,  0.0361,  1.8125, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -2.5938,  2.7812, -0.5469, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -1.0391,  2.5156, -0.1523, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1875, -5.4688, -0.4160,  1.1016, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -5.0312, -2.8438,  1.9062, -1.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -1.8672,  2.0469,  1.1172, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:50:44,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.27 | optimizer_step: 0.28
[2025-11-06 18:50:44,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.84 | bwd_microstep: 188.41 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 187.52 | step_microstep: 2.83
[2025-11-06 18:50:44,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.18 | bwd: 189.25 | bwd_inner: 1.54 | bwd_allreduce: 187.56 | step: 2.91
 76%|███████▋  | 2676/3507 [1:05:58<21:39,  1.56s/it]                                                     {'loss': 0.2086, 'learning_rate': 2.8043208766998088e-06, 'epoch': 0.76}
 76%|███████▋  | 2676/3507 [1:05:58<21:39,  1.56s/it]tensor([[-4.0000, -0.5664,  1.9453, -1.2109, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -4.8125,  0.2236,  1.7734, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -0.1270,  3.7969, -1.5547, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:44,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.09 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2188, -4.4062, -0.9961, -0.7500, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -0.0104,  2.3750, -0.3105, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5000, -5.0938,  0.5742,  3.0469, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -5.0938, -1.6406,  1.4297, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -4.4062, -1.1016,  2.9219, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:46,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:50:46,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.68 | bwd_microstep: 1.83 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.31
[2025-11-06 18:50:46,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.79 | bwd: 2.49 | bwd_inner: 1.48 | bwd_allreduce: 0.86 | step: 2.39
 76%|███████▋  | 2677/3507 [1:06:00<25:45,  1.86s/it]                                                     {'loss': 0.5554, 'learning_rate': 2.7979093891980257e-06, 'epoch': 0.76}
 76%|███████▋  | 2677/3507 [1:06:00<25:45,  1.86s/it]tensor([[-2.2031, -3.0156, -2.4844,  1.0391, -0.0728]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.0000, -1.1484,  3.3281, -0.2578, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8750, -1.6250,  1.7031,  5.0000,  0.1729]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.8750, -5.4062,  0.6445,  0.6172, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:47,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.52 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0938, -3.7500,  0.3887,  1.9297, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -4.9688, -1.1094,  2.4844, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -1.8984,  1.2188,  1.4062, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9844, -1.6875,  1.6172,  2.5469, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:47,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:50:47,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.49 | bwd_microstep: 1.55 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.68 | step_microstep: 3.33
[2025-11-06 18:50:47,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 430.04 | bwd: 2.46 | bwd_inner: 1.58 | bwd_allreduce: 0.73 | step: 3.43
 76%|███████▋  | 2678/3507 [1:06:01<19:59,  1.45s/it]                                                     {'loss': 0.6256, 'learning_rate': 2.791504047023734e-06, 'epoch': 0.76}
 76%|███████▋  | 2678/3507 [1:06:01<19:59,  1.45s/it]tensor([[-6.4688e+00, -3.5469e+00,  1.4609e+00,  5.1880e-03, -5.0625e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.1875, -6.4688, -1.2734,  0.2139, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:47,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.35 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.8438, -5.6562, -0.1069,  0.3086, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -5.5625, -0.8281,  1.7344, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -4.4375,  1.3750,  1.1641, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0312, -3.4688,  0.7930, -0.0415, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -4.6875, -1.2891,  2.1094, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -5.3750, -1.1484,  2.4844, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:49,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:50:49,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.95 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.64
[2025-11-06 18:50:49,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.31 | bwd: 2.83 | bwd_inner: 1.78 | bwd_allreduce: 0.90 | step: 2.73
 76%|███████▋  | 2679/3507 [1:06:03<23:48,  1.73s/it]                                                     {'loss': 0.2584, 'learning_rate': 2.78510485564241e-06, 'epoch': 0.76}
 76%|███████▋  | 2679/3507 [1:06:03<23:48,  1.73s/it]tensor([[-4.3438, -5.0938, -2.2344,  2.8281, -1.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.7500, -3.4219,  0.3887,  3.8594, -1.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -1.6484,  0.3887, -0.3340, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -4.3438,  1.3359,  2.5625, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7812, -5.0625,  0.8984,  0.2871, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:50,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.37 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.1406, -0.6055,  2.7812,  3.3438, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3281, -3.7812, -1.3359,  2.6250, -0.8555]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281, -3.0000, -0.0557,  2.7969, -1.2578]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:50:51,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:50:51,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.01 | bwd_microstep: 528.05 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 527.07 | step_microstep: 1.85
[2025-11-06 18:50:51,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 410.41 | bwd: 528.96 | bwd_inner: 1.68 | bwd_allreduce: 527.11 | step: 1.94
 76%|███████▋  | 2680/3507 [1:06:05<23:37,  1.71s/it]                                                     {'loss': 0.8273, 'learning_rate': 2.7787118205143005e-06, 'epoch': 0.76}
 76%|███████▋  | 2680/3507 [1:06:05<23:37,  1.71s/it]tensor([[-3.7344, -2.9062,  1.0000,  3.0938, -1.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -2.5312,  1.7734,  1.8984, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-11.0000,  -9.7500,  -4.7812,  -2.3750,  -7.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -4.6875, -0.7891,  3.0312, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:51,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.93 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.4062, -3.8125,  0.5156,  1.9297, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0625, -5.6562, -1.8359, -0.6367, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -4.4375, -0.6016,  2.3281, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -2.7188,  0.6523, -2.1719, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:51,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:50:51,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.56 | bwd_microstep: 1.59 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 1.37
[2025-11-06 18:50:51,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.52 | bwd: 2.38 | bwd_inner: 1.45 | bwd_allreduce: 0.79 | step: 1.47
 76%|███████▋  | 2681/3507 [1:06:05<18:06,  1.32s/it]                                                     {'loss': 0.3443, 'learning_rate': 2.772324947094388e-06, 'epoch': 0.76}
 76%|███████▋  | 2681/3507 [1:06:05<18:06,  1.32s/it]tensor([[-3.0312, -2.6094,  1.1953,  4.3125, -0.8945]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312,  0.5195,  4.5625, -2.1094, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1875, -1.9375,  2.0156,  3.4531, -1.5078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.6250, -5.9062,  0.1924,  1.8125, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -1.8438,  2.8281,  1.5234, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:52,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.00 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.2500, -1.0938,  1.8594, -1.2734, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -4.0625,  0.0854,  2.8438, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -4.5312,  0.3926,  2.6406, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:50:53,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.26
[2025-11-06 18:50:53,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.89 | bwd_microstep: 561.10 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 559.72 | step_microstep: 2.52
[2025-11-06 18:50:53,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.89 | bwd: 562.25 | bwd_inner: 2.28 | bwd_allreduce: 559.77 | step: 2.62
 76%|███████▋  | 2682/3507 [1:06:07<19:18,  1.40s/it]                                                     {'loss': 0.3268, 'learning_rate': 2.7659442408324e-06, 'epoch': 0.76}
 76%|███████▋  | 2682/3507 [1:06:07<19:18,  1.40s/it]tensor([[-5.8750, -3.4375,  1.7344,  1.9688, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -2.1250,  2.2344,  0.6797, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:53,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.79 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.5938, -3.4375,  1.7734,  4.4375, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -4.2188, -0.0447,  3.7500, -1.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5625,  2.1719,  3.0000, -1.7734, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.4453,  2.0156,  2.4688, -1.2422, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.1875, -2.7031,  1.6953,  0.8828, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.2188, -0.2412,  2.4062, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:50:54,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.28 | optimizer_step: 0.26
[2025-11-06 18:50:54,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.87 | bwd_microstep: 934.02 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 932.89 | step_microstep: 2.35
[2025-11-06 18:50:54,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.69 | bwd: 934.91 | bwd_inner: 1.77 | bwd_allreduce: 932.95 | step: 2.45
 77%|███████▋  | 2683/3507 [1:06:08<18:58,  1.38s/it]                                                     {'loss': 0.4404, 'learning_rate': 2.759569707172799e-06, 'epoch': 0.77}
 77%|███████▋  | 2683/3507 [1:06:08<18:58,  1.38s/it]tensor([[-7.0938, -5.2500, -0.3730,  0.5625, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -5.5625, -1.6641,  2.4062, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -3.9844,  1.9062,  1.4062, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -3.6094,  0.4492, -0.6758, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0781,  0.5703,  1.3359, -1.2266, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -2.4062,  1.7031, -0.4512, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -3.8438, -0.0708,  2.5000, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:56,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.27 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1875, -4.0000,  0.7773,  2.8906, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:56,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:50:56,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.36 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 2.29
[2025-11-06 18:50:56,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.61 | bwd: 3.10 | bwd_inner: 2.19 | bwd_allreduce: 0.78 | step: 2.36
 77%|███████▋  | 2684/3507 [1:06:10<20:38,  1.50s/it]                                                     {'loss': 0.3048, 'learning_rate': 2.7532013515547863e-06, 'epoch': 0.77}
 77%|███████▋  | 2684/3507 [1:06:10<20:38,  1.50s/it]tensor([[-5.8125, -1.8516,  2.3125, -2.0312, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -0.1621,  1.8984, -2.4375, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:50:56,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.45 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5938, -3.3125,  0.0747,  3.4062, -1.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.7500, -1.2109,  2.7031, -1.9609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2344, -0.3340,  0.8555, -1.5938, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3750,  0.3242,  2.3125,  0.4004, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812, -5.8438, -1.1406,  1.4766, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.1016,  2.4688,  1.4922, -1.3984, -0.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:50:57,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:50:57,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.05 | bwd_microstep: 829.84 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 828.68 | step_microstep: 1.88
[2025-11-06 18:50:57,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 480.52 | bwd: 830.86 | bwd_inner: 1.99 | bwd_allreduce: 828.73 | step: 1.97
 77%|███████▋  | 2685/3507 [1:06:11<19:59,  1.46s/it]                                                     {'loss': 0.476, 'learning_rate': 2.746839179412286e-06, 'epoch': 0.77}
 77%|███████▋  | 2685/3507 [1:06:11<19:59,  1.46s/it]tensor([[-4.8438, -4.1562,  0.0825,  3.1094, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2812, -5.0312,  0.7188,  0.8203, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0938, -5.0312,  0.9180,  1.7969, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -1.3125,  2.6562,  1.4688, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -3.5156,  0.6172,  1.5234, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:58,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.71 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-5.9062, -2.0156,  3.2188, -0.2969, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -1.7578,  2.7031, -1.0000, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -2.5625,  1.2500, -0.1445, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:50:58,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:50:58,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.76 | bwd_microstep: 127.24 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 125.89 | step_microstep: 1.65
[2025-11-06 18:50:58,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.50 | bwd: 128.32 | bwd_inner: 2.16 | bwd_allreduce: 125.94 | step: 1.77
 77%|███████▋  | 2686/3507 [1:06:12<17:41,  1.29s/it]                                                     {'loss': 0.8498, 'learning_rate': 2.7404831961739487e-06, 'epoch': 0.77}
 77%|███████▋  | 2686/3507 [1:06:12<17:41,  1.29s/it]tensor([[-4.6562, -2.9531,  1.0156,  1.6016, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -6.1250, -1.8594,  0.9766, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7500, -4.8750,  0.8125,  1.8750, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:50:59,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 278.23 | bwd_microstep: 1.17 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13
tensor([[-2.6094,  1.5000,  3.3125, -1.7109, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -1.1953,  2.3594,  0.7734, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9688, -3.5156,  2.3750,  0.0874, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0469,  1.7500,  2.2344, -2.8438, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.6250, -0.4668,  3.5781, -0.7734, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:51:00,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:51:00,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.73 | bwd_microstep: 1119.76 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1118.56 | step_microstep: 2.87
[2025-11-06 18:51:00,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.97 | bwd: 1120.93 | bwd_inner: 2.03 | bwd_allreduce: 1118.66 | step: 3.00
 77%|███████▋  | 2687/3507 [1:06:14<19:29,  1.43s/it]                                                     {'loss': 0.2882, 'learning_rate': 2.7341334072631456e-06, 'epoch': 0.77}
 77%|███████▋  | 2687/3507 [1:06:14<19:29,  1.43s/it]tensor([[-0.5312,  2.3125,  4.5938,  2.0781, -0.8242]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.4531,  0.4238,  1.6875, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6250, -5.0625,  1.2812,  1.4844, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4551,  3.4531,  3.3906, -2.2656, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5312, -4.0000,  0.4883,  1.8672, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0234,  2.2656,  2.4375, -1.9688, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -3.7969,  0.3535,  1.4453, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:01,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.92 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-4.5000, -1.6484,  2.2344,  0.4492, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:01,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:51:01,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.06 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.19
[2025-11-06 18:51:01,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.00 | bwd: 3.14 | bwd_inner: 1.95 | bwd_allreduce: 0.99 | step: 2.31
 77%|███████▋  | 2688/3507 [1:06:15<18:58,  1.39s/it]                                                     {'loss': 0.3826, 'learning_rate': 2.7277898180979544e-06, 'epoch': 0.77}
 77%|███████▋  | 2688/3507 [1:06:15<18:58,  1.39s/it]tensor([[-4.4375, -2.9688,  0.1611,  0.5312, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:01,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.91 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5625, -3.2656, -0.6328,  2.4688, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -5.3750, -0.6953,  2.4219, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750,  1.6719,  3.9219, -1.8828, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -5.7188, -2.5312,  2.1875, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -5.4375, -1.0938,  2.0469, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688,  0.0305,  4.0000, -2.6719, -5.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -5.1875, -1.2188,  2.4688, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:51:04,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.25 | optimizer_step: 0.32
[2025-11-06 18:51:04,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.54 | bwd_microstep: 541.67 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 540.43 | step_microstep: 255.63
[2025-11-06 18:51:04,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.45 | bwd: 542.56 | bwd_inner: 1.94 | bwd_allreduce: 540.48 | step: 255.71
 77%|███████▋  | 2689/3507 [1:06:17<22:11,  1.63s/it]                                                     {'loss': 0.1598, 'learning_rate': 2.721452434091182e-06, 'epoch': 0.77}
 77%|███████▋  | 2689/3507 [1:06:17<22:11,  1.63s/it]tensor([[-4.9688, -2.1875,  2.1875,  0.9023, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -3.9844,  0.0391,  2.2812, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -4.9375, -0.8945,  3.5312, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[h264 @ 0xd0e7300] mmco: unref short failure
[h264 @ 0xd0e7300] mmco: unref short failure
tensor([[-2.7188, -2.9531, -0.6836,  2.8594, -0.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -4.5312,  0.2969,  0.8633, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -2.3125,  1.0000, -2.5469, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.2188,  1.6406,  3.5469, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:05,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.67 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8438, -3.1406,  0.9570,  1.6406, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:05,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:51:05,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.13 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.58
[2025-11-06 18:51:05,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.84 | bwd: 2.63 | bwd_inner: 1.56 | bwd_allreduce: 0.93 | step: 2.66
 77%|███████▋  | 2690/3507 [1:06:19<22:49,  1.68s/it]                                                     {'loss': 0.4298, 'learning_rate': 2.7151212606503164e-06, 'epoch': 0.77}
 77%|███████▋  | 2690/3507 [1:06:19<22:49,  1.68s/it]tensor([[-5.9375, -3.2656,  1.2031, -0.1094, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9375, -0.7617,  2.9219,  0.1797, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:05,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.68 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-4.9062, -3.7500, -0.1406,  1.4219, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -2.3125,  1.2578,  1.0000, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750,  0.6484,  2.5469, -2.2812, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.6719, -0.6797,  2.5625,  0.4043, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -4.9688, -0.5039,  1.9219, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -5.1250, -1.1953,  2.0469, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:06,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.32 | optimizer_step: 0.34
[2025-11-06 18:51:06,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.63 | bwd_microstep: 4.46 | bwd_inner_microstep: 3.12 | bwd_allreduce_microstep: 1.15 | step_microstep: 3.29
[2025-11-06 18:51:06,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.42 | bwd: 5.18 | bwd_inner: 3.74 | bwd_allreduce: 1.20 | step: 3.40
 77%|███████▋  | 2691/3507 [1:06:20<18:09,  1.33s/it]                                                     {'loss': 0.5985, 'learning_rate': 2.7087963031775576e-06, 'epoch': 0.77}
 77%|███████▋  | 2691/3507 [1:06:20<18:09,  1.33s/it]tensor([[-2.0469,  1.5156,  3.1562, -1.5703, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -5.1562, -0.4336,  0.0508, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -2.8438,  0.3535,  1.2969, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5312, -6.3125, -1.1250,  1.1016, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7500, -3.9688,  1.1953,  0.5469, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -4.5938, -0.5352,  2.4688, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2812, -3.8750, -1.2266,  3.3125, -0.5898]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:09,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.50 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-6.5938, -4.5938, -0.0255,  0.5234, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:09,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:51:09,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.39 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.76 | step_microstep: 2.52
[2025-11-06 18:51:09,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.89 | bwd: 2.67 | bwd_inner: 1.77 | bwd_allreduce: 0.78 | step: 2.59
 77%|███████▋  | 2692/3507 [1:06:23<24:52,  1.83s/it]                                                     {'loss': 0.407, 'learning_rate': 2.702477567069809e-06, 'epoch': 0.77}
 77%|███████▋  | 2692/3507 [1:06:23<24:52,  1.83s/it]tensor([[-5.0938, -3.8750,  0.1719,  1.8906, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -1.6641,  1.7812, -0.2051, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:51:09,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.53 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0312, -4.3438,  0.1465,  1.2969, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.3906,  2.4219,  2.1562, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -4.0312,  1.3516, -0.1797, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0625, -3.8906,  1.4766,  1.8516, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719,  1.5000,  3.6406, -2.3281, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -2.5938,  1.8438,  0.7930, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:10,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:51:10,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 730.05 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 728.92 | step_microstep: 2.37
[2025-11-06 18:51:10,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.03 | bwd: 731.06 | bwd_inner: 1.98 | bwd_allreduce: 728.96 | step: 2.45
 77%|███████▋  | 2693/3507 [1:06:24<22:01,  1.62s/it]                                                     {'loss': 0.5407, 'learning_rate': 2.696165057718655e-06, 'epoch': 0.77}
 77%|███████▋  | 2693/3507 [1:06:24<22:01,  1.62s/it]tensor([[-5.5000, -1.3984,  3.2500, -1.2422, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.9258, 3.4062, 3.4844, 0.9414, 0.1865]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438,  0.1807,  2.6562, -0.1250, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -3.2969,  1.6094,  1.7500, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0938, -5.0938,  0.0286,  0.6016, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -2.1406,  2.6094,  0.6875, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -0.7383,  2.4062, -0.8086, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:12,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.35 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9375, -4.2812, -0.8984,  1.3828, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:12,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.19 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:51:12,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.03 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.86 | step_microstep: 3.33
[2025-11-06 18:51:12,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.39 | bwd: 3.03 | bwd_inner: 2.01 | bwd_allreduce: 0.89 | step: 3.42
 77%|███████▋  | 2694/3507 [1:06:26<22:59,  1.70s/it]                                                     {'loss': 0.3108, 'learning_rate': 2.6898587805103715e-06, 'epoch': 0.77}
 77%|███████▋  | 2694/3507 [1:06:26<22:59,  1.70s/it]tensor([[-5.6562, -2.8750,  1.5703,  0.4219, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:12,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.78 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.7500, -1.6406,  2.5156, -1.9453, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -4.5000,  0.0767,  2.9688, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7656,  0.6914,  2.4844, -1.4609, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2500, -2.5469, -0.1895,  2.2656, -1.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -3.3594,  2.0781, -0.0223, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -4.3125, -0.9727,  1.7734, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -4.7188, -1.2422,  3.2812, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:14,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 18:51:14,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.97 | bwd_microstep: 1423.61 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1422.57 | step_microstep: 2.22
[2025-11-06 18:51:14,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 257.77 | bwd: 1424.48 | bwd_inner: 1.74 | bwd_allreduce: 1422.61 | step: 2.29
 77%|███████▋  | 2695/3507 [1:06:27<23:02,  1.70s/it]                                                     {'loss': 0.4546, 'learning_rate': 2.683558740825908e-06, 'epoch': 0.77}
 77%|███████▋  | 2695/3507 [1:06:27<23:02,  1.70s/it]tensor([[-4.0625, -0.9336,  2.2500, -0.1064, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -5.0938, -1.1484,  2.0938, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -5.5938, -1.4297,  1.3203, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6094, -4.0938, -1.6172,  2.6250, -0.9570]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -5.8750, -1.6562,  1.3203, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -0.9102,  3.2812,  0.1973, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.7812,  0.2188,  2.1562, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:15,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.40 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.5625, -4.2500,  0.1602,  2.2344, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:15,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:51:15,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.48 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.67
[2025-11-06 18:51:15,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.90 | bwd: 3.11 | bwd_inner: 2.05 | bwd_allreduce: 0.90 | step: 2.77
 77%|███████▋  | 2696/3507 [1:06:29<22:16,  1.65s/it]                                                     {'loss': 0.7508, 'learning_rate': 2.6772649440409084e-06, 'epoch': 0.77}
 77%|███████▋  | 2696/3507 [1:06:29<22:16,  1.65s/it]tensor([[-0.6602,  2.8125,  2.5312, -2.2812, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -0.7773,  2.5156, -0.1875, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:15,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.92 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.3750,  0.1562,  3.5781,  0.0298, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -4.1250, -0.3164,  2.3281, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -4.4688, -1.2031,  2.6875, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4062, -1.1562,  1.5625,  0.5938, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -2.4688,  1.4219,  1.4297, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.9219, -0.3301,  2.3750, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:17,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:51:17,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.22 | bwd_microstep: 1472.61 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1471.55 | step_microstep: 1.71
[2025-11-06 18:51:17,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.17 | bwd: 1473.49 | bwd_inner: 1.77 | bwd_allreduce: 1471.59 | step: 1.78
 77%|███████▋  | 2697/3507 [1:06:31<23:14,  1.72s/it]                                                     {'loss': 0.2906, 'learning_rate': 2.6709773955256748e-06, 'epoch': 0.77}
 77%|███████▋  | 2697/3507 [1:06:31<23:14,  1.72s/it]tensor([[-2.6094,  1.0078,  2.3750, -1.5859, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.3750, -4.9375, -1.0859,  2.0312, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([3], device='cuda:3')
tensor([[-4.6875,  0.2188,  3.9531, -2.4219, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.6250,  0.1641,  3.4531, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -4.4062, -0.0381,  2.8281, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -4.2188, -1.1172,  3.9844, -0.5664]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -3.4844,  1.5078,  1.4062, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:18,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.68 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3438, -4.3125, -0.2637,  1.9766, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:19,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:51:19,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.74 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.85 | step_microstep: 98.67
[2025-11-06 18:51:19,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.43 | bwd: 3.09 | bwd_inner: 2.08 | bwd_allreduce: 0.89 | step: 98.75
 77%|███████▋  | 2698/3507 [1:06:32<22:47,  1.69s/it]                                                     {'loss': 0.3345, 'learning_rate': 2.6646961006451866e-06, 'epoch': 0.77}
 77%|███████▋  | 2698/3507 [1:06:32<22:47,  1.69s/it]tensor([[-3.6250, -0.2422,  2.0000, -1.9531, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9375, -0.0073,  2.9531, -1.4844, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.7188, -2.7188,  0.2480,  3.9688, -0.3848]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([3], device='cuda:1')
[2025-11-06 18:51:19,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.06 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.8750,  1.4766,  3.5625, -1.9688, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -4.9688, -0.7930,  2.5469, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -5.3125, -1.7969,  2.4531, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -5.4375, -1.9609,  3.0625, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -4.9688, -0.8086,  3.4531, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:20,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:51:20,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.44 | bwd_microstep: 1177.55 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1176.58 | step_microstep: 1.78
[2025-11-06 18:51:20,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.52 | bwd: 1178.52 | bwd_inner: 1.75 | bwd_allreduce: 1176.62 | step: 1.86
 77%|███████▋  | 2699/3507 [1:06:34<22:20,  1.66s/it]                                                     {'loss': 0.0561, 'learning_rate': 2.6584210647590813e-06, 'epoch': 0.77}
 77%|███████▋  | 2699/3507 [1:06:34<22:20,  1.66s/it]tensor([[-4.6250, -0.4121,  2.8906, -2.2812, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -1.1250,  2.1562, -0.7227, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -3.3906,  2.2969, -0.1650, -5.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -0.0396,  2.4062, -0.3789, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031, -2.0000,  2.1719,  2.7969, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0000, -6.3438, -1.4844,  1.7969, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6406, -1.5000,  2.9062,  2.7344, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:21,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.32 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.19
tensor([[-7.0000, -5.2500,  0.5703,  1.9297, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:21,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.15 | optimizer_step: 0.23
[2025-11-06 18:51:21,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.46 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.17
[2025-11-06 18:51:21,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 260.78 | bwd: 2.69 | bwd_inner: 1.59 | bwd_allreduce: 0.96 | step: 2.37
 77%|███████▋  | 2700/3507 [1:06:35<18:50,  1.40s/it]                                                     {'loss': 0.5976, 'learning_rate': 2.6521522932216603e-06, 'epoch': 0.77}
 77%|███████▋  | 2700/3507 [1:06:35<18:50,  1.40s/it]tensor([[-6.2188, -6.0938, -1.7891,  2.0156, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1406, -0.7031,  2.5469,  1.1094, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -4.1875,  1.0781,  3.7969, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:21,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.63 | bwd_microstep: 5.42 | bwd_inner_microstep: 5.27 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.9375, -4.2500, -0.5898,  2.1406, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5938,  1.1484,  2.1250, -2.3438, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -4.0938, -0.0820,  2.6406, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -4.6875, -1.7188,  2.4844, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -3.9688,  0.5625,  0.6289, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:23,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:51:23,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.99 | bwd_microstep: 2070.07 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2069.00 | step_microstep: 1.97
[2025-11-06 18:51:23,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.62 | bwd: 2075.48 | bwd_inner: 6.25 | bwd_allreduce: 2069.06 | step: 2.08
 77%|███████▋  | 2701/3507 [1:06:37<22:58,  1.71s/it]                                                     {'loss': 0.1916, 'learning_rate': 2.645889791381877e-06, 'epoch': 0.77}
 77%|███████▋  | 2701/3507 [1:06:37<22:58,  1.71s/it]tensor([[-3.3125,  0.9180,  3.1250, -2.2812, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.5000, -5.0625, -1.0156,  2.4375, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-7.1250, -5.3125, -1.3203, -0.6250, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([3], device='cuda:2')
tensor([[-5.8125, -3.2500,  1.9141,  1.5391, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -2.7969,  1.1016, -1.1094, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -4.0625, -1.1797,  2.5312, -1.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -2.9688,  2.0156,  1.0391, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:24,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.73 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7500, -3.8281,  0.3594,  2.8594, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:24,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:51:24,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.23 | bwd_microstep: 1.78 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.36
[2025-11-06 18:51:24,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.97 | bwd: 2.51 | bwd_inner: 1.50 | bwd_allreduce: 0.86 | step: 2.44
 77%|███████▋  | 2702/3507 [1:06:38<19:24,  1.45s/it]                                                     {'loss': 0.654, 'learning_rate': 2.639633564583337e-06, 'epoch': 0.77}
 77%|███████▋  | 2702/3507 [1:06:38<19:24,  1.45s/it]tensor([[-5.9062, -2.6250,  2.7344,  1.0312, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -1.9297,  2.0469,  0.1963, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:24,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.58 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13
tensor([[-7.1875, -3.3281,  2.6250, -0.6797, -6.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -6.5312, -2.4062,  2.1406, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -4.5312, -3.3125,  1.1016, -0.8828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -3.1875,  0.7188,  1.4375, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -5.9375, -2.4219,  1.1094, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -3.2500,  0.7305, -0.4023, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:25,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:51:25,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.83 | bwd_microstep: 242.96 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 241.56 | step_microstep: 1.56
[2025-11-06 18:51:25,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.43 | bwd: 244.03 | bwd_inner: 2.20 | bwd_allreduce: 241.63 | step: 1.70
 77%|███████▋  | 2703/3507 [1:06:39<16:09,  1.21s/it]                                                     {'loss': 0.1623, 'learning_rate': 2.633383618164289e-06, 'epoch': 0.77}
 77%|███████▋  | 2703/3507 [1:06:39<16:09,  1.21s/it]tensor([[-2.2812,  0.4180,  3.7812,  2.4219, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -1.4922,  2.8594, -0.9219, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -4.3125,  0.0630,  3.8438, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5000, -4.0938,  1.8281, -0.3086, -6.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -4.2812, -0.4473,  3.1562, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -4.5000, -0.4277,  3.0156, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -5.2812, -2.3594,  1.6562, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:27,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.46 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.4844,  1.4453,  3.0000, -1.9297, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:27,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.26 | optimizer_step: 0.29
[2025-11-06 18:51:27,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.96 | bwd_microstep: 1.92 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.41
[2025-11-06 18:51:27,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.43 | bwd: 2.83 | bwd_inner: 1.74 | bwd_allreduce: 0.92 | step: 2.52
 77%|███████▋  | 2704/3507 [1:06:41<18:48,  1.41s/it]                                                     {'loss': 0.5446, 'learning_rate': 2.627139957457623e-06, 'epoch': 0.77}
 77%|███████▋  | 2704/3507 [1:06:41<18:48,  1.41s/it]tensor([[-3.1562, -2.3750,  1.8203,  4.4688, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -4.9375, -1.9688,  2.0312, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500,  0.1177,  3.5625, -2.0156, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:27,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.06 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -0.6914,  3.1562, -0.7891, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4375, -3.7656, -1.4766,  2.5312, -0.8945]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -1.9531,  2.7656,  0.6094, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -2.9062,  1.8984,  1.2266, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.8906,  1.8281,  2.2031, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:29,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.25 | optimizer_step: 0.26
[2025-11-06 18:51:29,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.97 | bwd_microstep: 1354.89 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1353.75 | step_microstep: 2.44
[2025-11-06 18:51:29,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.04 | bwd: 1355.76 | bwd_inner: 1.78 | bwd_allreduce: 1353.82 | step: 2.52
 77%|███████▋  | 2705/3507 [1:06:42<20:15,  1.52s/it]                                                     {'loss': 0.1643, 'learning_rate': 2.6209025877908746e-06, 'epoch': 0.77}
 77%|███████▋  | 2705/3507 [1:06:42<20:15,  1.52s/it]tensor([[-3.5781, -4.1250, -1.5078,  3.1094, -0.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.9375, -3.3125,  2.3438, -0.0693, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -4.5625, -0.3730,  2.9844, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.0000,  0.5352,  1.6172, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -0.3008,  1.5391, -1.8906, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.4062, -4.4062,  1.9766,  0.8789, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -5.1250, -1.6719,  2.8281, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:30,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.17 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.6250, -4.5000, -0.0630,  2.2656, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:30,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.17 | optimizer_step: 0.22
[2025-11-06 18:51:30,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.64 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.48
[2025-11-06 18:51:30,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.79 | bwd: 2.97 | bwd_inner: 1.87 | bwd_allreduce: 0.96 | step: 2.57
 77%|███████▋  | 2706/3507 [1:06:44<19:18,  1.45s/it]                                                     {'loss': 0.6236, 'learning_rate': 2.614671514486197e-06, 'epoch': 0.77}
 77%|███████▋  | 2706/3507 [1:06:44<19:18,  1.45s/it]tensor([[-3.6094, -4.1562, -1.5781,  2.7812, -0.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -4.4688,  0.1787,  2.6875, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.0938,  0.2012,  2.0312, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:30,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.92 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7344, -3.3438, -1.4531,  2.8594, -0.2988]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.5000,  1.2891,  3.1406, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.8750, -3.9844,  1.9219,  0.9062, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7344, -0.3984,  1.9141, -1.6641, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -4.1875, -1.0859,  2.7500, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:31,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:51:31,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.32 | bwd_microstep: 1240.93 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1239.66 | step_microstep: 2.02
[2025-11-06 18:51:31,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.27 | bwd: 1241.81 | bwd_inner: 1.97 | bwd_allreduce: 1239.71 | step: 2.11
 77%|███████▋  | 2707/3507 [1:06:45<20:02,  1.50s/it]                                                     {'loss': 0.2473, 'learning_rate': 2.6084467428603786e-06, 'epoch': 0.77}
 77%|███████▋  | 2707/3507 [1:06:45<20:02,  1.50s/it]tensor([[-4.0000, -1.7344,  1.1484,  0.1099, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -3.3906,  1.3984,  2.9062, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -1.9688,  2.2812, -2.1406, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -3.3906,  0.6445,  1.7422, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -3.8125,  1.0000,  0.8711, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -3.0625, -0.1699, -2.7812, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5000, -2.6562,  2.6562, -0.6914, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:33,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.38 | bwd_microstep: 1.35 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.8125, -1.3984,  1.0781, -2.3750, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:33,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:51:33,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.07 | bwd_microstep: 2.14 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.22
[2025-11-06 18:51:33,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.47 | bwd: 3.49 | bwd_inner: 2.48 | bwd_allreduce: 0.85 | step: 2.33
 77%|███████▋  | 2708/3507 [1:06:47<21:09,  1.59s/it]                                                     {'loss': 0.2256, 'learning_rate': 2.6022282782248277e-06, 'epoch': 0.77}
 77%|███████▋  | 2708/3507 [1:06:47<21:09,  1.59s/it]tensor([[-4.3125, -1.3047,  1.1953, -1.3750, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -1.8516,  1.8438,  0.8750, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -3.2656,  0.8789,  2.0938, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9062,  0.3438,  3.8750, -1.5000, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:33,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.60 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-1.5625,  2.3750,  3.8594, -1.0000, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0938,  1.2578,  3.1094, -0.6445, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[ 0.1123,  3.1094,  2.0938, -1.9531, -1.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.4688, -5.7500, -0.7500,  2.7969, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:34,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.21 | optimizer_step: 0.27
[2025-11-06 18:51:34,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 329.17 | bwd_microstep: 2.32 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1.15 | step_microstep: 2.13
[2025-11-06 18:51:34,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 549.80 | bwd: 3.17 | bwd_inner: 1.80 | bwd_allreduce: 1.21 | step: 2.23
 77%|███████▋  | 2709/3507 [1:06:48<17:12,  1.29s/it]                                                     {'loss': 0.319, 'learning_rate': 2.5960161258855807e-06, 'epoch': 0.77}
 77%|███████▋  | 2709/3507 [1:06:48<17:12,  1.29s/it]tensor([[-5.7188, -3.4844,  0.8086,  0.3887, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -2.9219,  0.6328,  4.8438, -0.2773]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0312, -2.2188,  2.4688, -0.8672, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:34,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.25 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.7188,  0.8477,  3.6094, -2.3281, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5234,  2.0156,  2.7188, -1.9062, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -2.4844,  1.2812,  0.7461, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5625, -6.0938,  0.2559,  2.7656, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.7500, -6.7500, -0.4238,  0.9141, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:35,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:51:35,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.76 | bwd_microstep: 1141.91 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1140.66 | step_microstep: 2.12
[2025-11-06 18:51:35,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.02 | bwd: 1142.99 | bwd_inner: 2.07 | bwd_allreduce: 1140.71 | step: 2.22
 77%|███████▋  | 2710/3507 [1:06:49<18:16,  1.38s/it]                                                     {'loss': 0.4527, 'learning_rate': 2.5898102911432755e-06, 'epoch': 0.77}
 77%|███████▋  | 2710/3507 [1:06:49<18:16,  1.38s/it]tensor([[-4.4062, -2.6250,  2.1719,  3.2500, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8750, -4.0625, -0.0302, -1.3047, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -5.8438, -2.5312,  1.7344, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:36,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.82 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8438, -0.9453,  2.9531, -0.9023, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -3.9531,  2.1406,  1.9688, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -2.6719,  2.7812,  0.4453, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -5.3750, -1.2578,  2.1719, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5156,  0.0374,  2.5312, -1.1328, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:51:36,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:51:36,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.50 | bwd_microstep: 7.10 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 5.73 | step_microstep: 1.89
[2025-11-06 18:51:36,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.34 | bwd: 7.99 | bwd_inner: 2.06 | bwd_allreduce: 5.78 | step: 1.98
 77%|███████▋  | 2711/3507 [1:06:50<14:32,  1.10s/it]                                                     {'loss': 0.3455, 'learning_rate': 2.5836107792931653e-06, 'epoch': 0.77}
 77%|███████▋  | 2711/3507 [1:06:50<14:32,  1.10s/it]tensor([[-0.8359,  2.4531,  2.5469, -1.8281, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7500, -5.5625, -0.6055,  1.6172, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -3.0781,  2.8906, -0.1089, -5.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -4.1875, -0.7188,  3.3906, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -2.9062,  0.4902,  4.3125, -0.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -3.1562,  0.1406,  3.0312, -1.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -2.5469,  2.0938,  2.1250, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:38,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.30 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6875, -5.2188,  0.6758,  2.6875, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:38,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:51:38,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.26 | bwd_microstep: 2.33 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1.06 | step_microstep: 2.48
[2025-11-06 18:51:38,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.60 | bwd: 3.27 | bwd_inner: 2.01 | bwd_allreduce: 1.08 | step: 2.55
 77%|███████▋  | 2712/3507 [1:06:52<18:39,  1.41s/it]                                                     {'loss': 0.2213, 'learning_rate': 2.577417595625107e-06, 'epoch': 0.77}
 77%|███████▋  | 2712/3507 [1:06:52<18:39,  1.41s/it]tensor([[-4.6562, -2.4688, -1.3438, -2.7969, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.1250, -4.2188, -0.0605,  2.3594, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -1.6172,  2.7031, -1.7109, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -2.6719,  2.6250, -0.0292, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -1.3984,  3.0625, -0.9023, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -4.5312, -0.5234,  3.3125, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.3438,  0.5664,  0.8398, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:39,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.19 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.6875, -3.4219,  2.1719,  0.1182, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:39,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:51:39,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.40 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.06
[2025-11-06 18:51:39,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.62 | bwd: 2.88 | bwd_inner: 1.89 | bwd_allreduce: 0.85 | step: 2.13
 77%|███████▋  | 2713/3507 [1:06:53<17:14,  1.30s/it]                                                     {'loss': 0.3587, 'learning_rate': 2.5712307454235585e-06, 'epoch': 0.77}
 77%|███████▋  | 2713/3507 [1:06:53<17:14,  1.30s/it]tensor([[-2.5156,  2.0469,  4.2500, -1.7812, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562, -3.1562,  0.6836,  4.5938, -0.7109]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156,  1.3906,  2.9219, -1.6797, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.0312, -2.8906, -1.2188,  3.1875,  0.3398]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -5.8125, -2.3281,  1.3516, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:40,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.12 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0625, -0.8594,  3.6875, -0.8008, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.2812, -6.5625, -1.7344,  1.3359, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.5781,  0.9492,  1.5703, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:41,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:51:41,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.30 | bwd_microstep: 564.79 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 563.77 | step_microstep: 2.74
[2025-11-06 18:51:41,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.43 | bwd: 565.66 | bwd_inner: 1.69 | bwd_allreduce: 563.82 | step: 2.82
 77%|███████▋  | 2714/3507 [1:06:55<19:12,  1.45s/it]                                                     {'loss': 0.3905, 'learning_rate': 2.565050233967573e-06, 'epoch': 0.77}
 77%|███████▋  | 2714/3507 [1:06:55<19:12,  1.45s/it]tensor([[-3.0156,  0.5039,  3.2969, -0.8242, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0469, -3.9688, -2.7656,  1.4062, -0.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:51:41,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.89 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -4.6875, -1.8281,  1.6094, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -0.6289,  3.0000,  0.9219, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -2.6406,  0.9570,  1.5859, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4688, -5.4062, -0.9922,  1.1328, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -4.7812, -1.4375,  2.2969, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -4.8750, -1.2344,  1.6797, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:51:42,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.30 | optimizer_step: 0.37
[2025-11-06 18:51:42,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.64 | bwd_microstep: 1165.76 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 1164.90 | step_microstep: 3.02
[2025-11-06 18:51:42,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.57 | bwd: 1166.46 | bwd_inner: 1.35 | bwd_allreduce: 1164.96 | step: 3.10
 77%|███████▋  | 2715/3507 [1:06:56<19:39,  1.49s/it]                                                     {'loss': 0.3828, 'learning_rate': 2.5588760665307953e-06, 'epoch': 0.77}
 77%|███████▋  | 2715/3507 [1:06:56<19:39,  1.49s/it]tensor([[-3.4531,  0.9688,  3.2031, -2.6875, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8125, -2.8281,  1.5625,  3.7812, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -3.6562,  0.0116,  1.5234, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -4.1562, -0.9453,  1.8828, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7656,  0.2598,  3.1250,  0.5117, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:44,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.69 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -4.3750, -0.2793,  3.4062, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8516,  1.4531,  1.6562, -2.2500, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3750, -2.7969, -1.4062,  2.0156, -0.2793]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:51:45,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:51:45,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.84 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.36
[2025-11-06 18:51:45,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.55 | bwd: 2.66 | bwd_inner: 1.65 | bwd_allreduce: 0.86 | step: 2.46
 77%|███████▋  | 2716/3507 [1:06:59<22:49,  1.73s/it]                                                     {'loss': 1.0979, 'learning_rate': 2.5527082483814537e-06, 'epoch': 0.77}
 77%|███████▋  | 2716/3507 [1:06:59<22:49,  1.73s/it]tensor([[-2.7812, -3.4219, -0.5273,  4.3438, -0.0972]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -3.8438,  1.0234,  2.1406, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -3.1094,  0.7227,  1.7344, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -4.4062, -0.0449,  2.9219, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:45,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.08 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.9062, -2.2188,  2.9219,  0.0898, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -4.9688, -0.9062,  1.7578, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -6.8125, -3.2969,  1.6719, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.1875, -3.8906,  2.3594,  0.8555, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:45,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.23
[2025-11-06 18:51:45,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.82 | bwd_microstep: 84.24 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 83.26 | step_microstep: 1.99
[2025-11-06 18:51:45,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.92 | bwd: 85.16 | bwd_inner: 1.68 | bwd_allreduce: 83.31 | step: 2.10
 77%|███████▋  | 2717/3507 [1:06:59<18:13,  1.38s/it]                                                     {'loss': 0.6272, 'learning_rate': 2.546546784782371e-06, 'epoch': 0.77}
 77%|███████▋  | 2717/3507 [1:06:59<18:13,  1.38s/it]tensor([[-4.7188, -4.0625, -0.2119,  2.6719, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -2.5312,  1.2734,  1.1094, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -6.5625, -2.5781,  1.9922, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:46,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.48 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-1.7891, -0.0996,  3.4375,  3.8906, -0.7461]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -4.6562,  0.3281,  0.9258, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -3.1719,  0.3711,  1.2578, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.3750, -3.7031,  0.9258,  1.8281, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.3125,  1.0625,  0.7969, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:49,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:51:49,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.84 | bwd_microstep: 1.78 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.34
[2025-11-06 18:51:49,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.34 | bwd: 2.76 | bwd_inner: 1.87 | bwd_allreduce: 0.77 | step: 2.43
 78%|███████▊  | 2718/3507 [1:07:03<26:36,  2.02s/it]                                                     {'loss': 1.0196, 'learning_rate': 2.54039168099093e-06, 'epoch': 0.78}
 78%|███████▊  | 2718/3507 [1:07:03<26:36,  2.02s/it]tensor([[-5.0312, -2.8125,  0.9258,  0.4473, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2500, -5.2500,  0.7383,  1.8125, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -3.9531,  0.0879,  1.4453, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -4.4688, -0.0143,  1.5000, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:49,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.28 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7500, -3.1094,  0.4688, -1.1719, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3750, -2.7344,  0.2559,  2.7812, -1.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -1.2812,  1.8438,  0.1582, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -2.7812, -0.3203,  1.4062, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:49,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:51:49,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.57 | bwd_microstep: 6.90 | bwd_inner_microstep: 6.09 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.73
[2025-11-06 18:51:49,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.88 | bwd: 7.66 | bwd_inner: 6.77 | bwd_allreduce: 0.75 | step: 1.82
 78%|███████▊  | 2719/3507 [1:07:03<20:11,  1.54s/it]                                                     {'loss': 0.4573, 'learning_rate': 2.5342429422590984e-06, 'epoch': 0.78}
 78%|███████▊  | 2719/3507 [1:07:03<20:11,  1.54s/it]tensor([[-5.6250, -2.4375,  2.0938, -0.2656, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -5.0312, -1.0469,  2.3281, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:49,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.21 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.5625, -2.5938,  1.5078, -0.3516, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -4.4375,  0.1494,  3.1562, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -5.1562,  1.1250,  1.8438, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -2.9531,  0.8867,  1.6172, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4688, -3.0469,  0.1396,  3.1094, -1.2734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.6094,  0.0239,  2.8125, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:51,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:51:51,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.32 | bwd_microstep: 809.45 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 808.24 | step_microstep: 1.94
[2025-11-06 18:51:51,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.56 | bwd: 810.41 | bwd_inner: 1.97 | bwd_allreduce: 808.29 | step: 2.04
 78%|███████▊  | 2720/3507 [1:07:05<21:19,  1.63s/it]                                                     {'loss': 0.2521, 'learning_rate': 2.5281005738334087e-06, 'epoch': 0.78}
 78%|███████▊  | 2720/3507 [1:07:05<21:19,  1.63s/it]tensor([[-4.8125,  0.1104,  3.7188, -2.8750, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:51,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.41 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.3438, -4.3125,  0.1709,  0.6211, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -4.0938, -1.2109,  2.9062, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7969, -1.4219,  2.1250,  3.4219, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.1875,  1.0234,  1.8438, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -2.3281,  2.7031, -0.6562, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.6875, -6.9375, -0.4668,  1.4062, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.4375, -6.9375, -0.6172,  1.8750, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:51:52,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:51:52,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.29 | bwd_microstep: 144.64 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 143.60 | step_microstep: 1.52
[2025-11-06 18:51:52,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.72 | bwd: 145.61 | bwd_inner: 1.83 | bwd_allreduce: 143.64 | step: 1.61
 78%|███████▊  | 2721/3507 [1:07:05<17:01,  1.30s/it]                                                     {'loss': 0.6572, 'learning_rate': 2.521964580954964e-06, 'epoch': 0.78}
 78%|███████▊  | 2721/3507 [1:07:05<17:01,  1.30s/it]tensor([[-7.8125, -5.7188,  0.4688,  1.4688, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:52,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.97 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7500, -3.0156,  1.1094,  1.8516, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -3.3750,  0.5469,  3.9688, -1.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -6.0938, -0.8555,  3.2188, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -1.7812,  2.2812,  2.1562, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -3.6875,  1.3672,  1.2109, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0312, -0.3848,  2.0312, -0.1924, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -5.0000, -2.3750,  2.0469, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:53,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:51:53,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.59 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.94 | step_microstep: 2.14
[2025-11-06 18:51:53,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.58 | bwd: 2.99 | bwd_inner: 1.89 | bwd_allreduce: 0.97 | step: 2.21
 78%|███████▊  | 2722/3507 [1:07:07<17:12,  1.31s/it]                                                     {'loss': 0.3006, 'learning_rate': 2.515834968859423e-06, 'epoch': 0.78}
 78%|███████▊  | 2722/3507 [1:07:07<17:12,  1.31s/it]tensor([[-6.1875, -4.4688,  0.4531,  1.4531, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:53,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.99 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6562, -5.9688, -2.5000,  1.7734, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1250, -5.2188, -0.6602,  0.6875, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.2500, -5.4062,  0.0342, -0.5312, -6.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6562, -3.0469,  2.5938,  1.9766, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-10.1250,  -7.2500,  -1.4609,  -1.9375,  -7.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0312, -4.2500,  1.8203,  1.1250, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -4.6250, -0.6914,  1.8516, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:51:54,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:51:54,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.32 | bwd_microstep: 388.37 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 387.46 | step_microstep: 2.26
[2025-11-06 18:51:54,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.33 | bwd: 389.38 | bwd_inner: 1.70 | bwd_allreduce: 387.52 | step: 2.36
 78%|███████▊  | 2723/3507 [1:07:07<14:49,  1.13s/it]                                                     {'loss': 0.5791, 'learning_rate': 2.5097117427769925e-06, 'epoch': 0.78}
 78%|███████▊  | 2723/3507 [1:07:07<14:49,  1.13s/it]tensor([[-2.2344,  1.0938,  2.8125, -0.9531, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -2.6250,  2.2656,  0.2637, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -4.2188,  1.6875,  1.5781, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.9062, -6.9688, -2.4531,  0.4746, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:54,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.50 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.4844, -3.7969, -0.5430,  3.6719, -0.8477]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4062, -3.3125,  1.6172, -0.2500, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000, -2.4844,  1.0859,  2.9375, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -1.2812,  3.0938, -0.6406, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:55,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:51:55,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.04 | bwd_microstep: 709.76 | bwd_inner_microstep: 1.30 | bwd_allreduce_microstep: 708.34 | step_microstep: 2.90
[2025-11-06 18:51:55,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.46 | bwd: 710.64 | bwd_inner: 2.07 | bwd_allreduce: 708.40 | step: 2.98
 78%|███████▊  | 2724/3507 [1:07:09<17:18,  1.33s/it]                                                     {'loss': 0.1887, 'learning_rate': 2.5035949079324396e-06, 'epoch': 0.78}
 78%|███████▊  | 2724/3507 [1:07:09<17:18,  1.33s/it]tensor([[-1.9844,  0.9844,  2.0938, -1.3125, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:51:56,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.11 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.16
tensor([[-6.7500, -4.0000,  1.6016,  0.8594, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844,  0.2295,  2.2500, -2.2812, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -1.6094,  2.3125, -0.1846, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.1436,  2.8125,  5.0625,  1.9688, -0.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -0.0972,  3.8750, -0.3789, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -4.3438, -0.3906,  3.6250, -1.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -4.6562,  0.1050,  2.1562, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:57,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:51:57,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.15 | bwd_microstep: 1396.62 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1395.38 | step_microstep: 2.16
[2025-11-06 18:51:57,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.28 | bwd: 1397.84 | bwd_inner: 2.11 | bwd_allreduce: 1395.48 | step: 2.32
 78%|███████▊  | 2725/3507 [1:07:11<19:15,  1.48s/it]                                                     {'loss': 0.4171, 'learning_rate': 2.4974844695450794e-06, 'epoch': 0.78}
 78%|███████▊  | 2725/3507 [1:07:11<19:15,  1.48s/it]tensor([[-4.7188, -1.7891,  2.2031,  0.1045, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.0312, -0.3770,  3.8594, -2.1719, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([2], device='cuda:2')
[2025-11-06 18:51:57,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.16 | bwd_microstep: 1.25 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3438, -3.8750,  0.6602,  2.2188, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -3.0156,  0.3906,  1.6875, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688e+00, -3.7812e+00,  2.2278e-03,  1.2266e+00, -3.0938e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -4.3750, -0.4043,  3.4062, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -2.9531,  2.1875, -0.7227, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938,  0.0417,  3.2656, -1.5000, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:51:59,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:51:59,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.15 | bwd_microstep: 1216.93 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1215.81 | step_microstep: 2.37
[2025-11-06 18:51:59,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.33 | bwd: 1218.17 | bwd_inner: 2.17 | bwd_allreduce: 1215.85 | step: 2.46
 78%|███████▊  | 2726/3507 [1:07:13<19:32,  1.50s/it]                                                     {'loss': 0.4763, 'learning_rate': 2.4913804328287626e-06, 'epoch': 0.78}
 78%|███████▊  | 2726/3507 [1:07:13<19:32,  1.50s/it]tensor([[-4.6562, -3.2656,  1.0078,  2.4219, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4062, -5.8750,  0.1553,  2.1875, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:51:59,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.37 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.2500,  0.5625,  4.1875, -2.3125, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -3.5156, -1.2266,  1.5000, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -1.3906,  2.2656,  0.4375, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2969,  1.1797,  3.0938, -0.5430, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -2.0469,  2.9688,  0.7344, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1562, -3.2031,  1.0859, -2.7812, -6.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:52:01,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:52:01,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.09 | bwd_microstep: 1361.92 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 1360.91 | step_microstep: 2.63
[2025-11-06 18:52:01,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.48 | bwd: 1362.83 | bwd_inner: 1.70 | bwd_allreduce: 1360.97 | step: 2.74
 78%|███████▊  | 2727/3507 [1:07:14<20:33,  1.58s/it]                                                     {'loss': 0.6915, 'learning_rate': 2.4852828029918818e-06, 'epoch': 0.78}
 78%|███████▊  | 2727/3507 [1:07:14<20:33,  1.58s/it]tensor([[-2.8594, -3.7500, -2.3594,  2.0000, -0.3262]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7188, -4.3125,  0.1650,  2.0000, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1250, -6.1562, -1.4219,  1.4062, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -5.2812, -1.7422,  2.4844, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.3203,  3.0000,  3.9375, -2.3438, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -5.0000, -0.8906,  2.4531, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:01,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.16 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.3438, -6.0938, -1.0234,  1.1641, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1562, -3.5469, -0.0493,  2.6094, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:01,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:52:01,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.64 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.95
[2025-11-06 18:52:01,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.81 | bwd: 3.17 | bwd_inner: 2.15 | bwd_allreduce: 0.87 | step: 2.04
 78%|███████▊  | 2728/3507 [1:07:15<17:29,  1.35s/it]                                                     {'loss': 0.1152, 'learning_rate': 2.4791915852373604e-06, 'epoch': 0.78}
 78%|███████▊  | 2728/3507 [1:07:15<17:29,  1.35s/it]tensor([[-5.0312, -4.9375, -1.2734,  2.5312, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:02,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.73 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0625, -4.3750,  1.0547,  0.1328, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2188, -4.0312,  0.7070, -1.7031, -6.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -4.4062,  0.4043,  0.5273, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5625, -4.5938, -0.1338,  2.2812, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8672,  2.2969,  2.9062, -2.5938, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0000, -0.0184,  4.1250,  0.0369, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000,  0.2988,  3.4375, -1.4609, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:52:03,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:52:03,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.11 | bwd_microstep: 1313.85 | bwd_inner_microstep: 2.19 | bwd_allreduce_microstep: 1311.55 | step_microstep: 1.51
[2025-11-06 18:52:03,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.87 | bwd: 1314.57 | bwd_inner: 2.81 | bwd_allreduce: 1311.59 | step: 1.60
 78%|███████▊  | 2729/3507 [1:07:17<19:01,  1.47s/it]                                                     {'loss': 0.3072, 'learning_rate': 2.4731067847626512e-06, 'epoch': 0.78}
 78%|███████▊  | 2729/3507 [1:07:17<19:01,  1.47s/it]tensor([[-2.3438,  1.5703,  3.8281, -0.7148, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.5000, -4.6562, -0.3379,  0.1797, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -5.9688, -1.4844,  2.1562, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:03,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.22 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.8750, -3.4844,  1.7344,  1.6875, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8750, -4.1562,  1.7578,  1.3281, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -0.3164,  3.1875, -3.0625, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -5.1875, -0.9180,  0.6875, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7188, -4.8125,  0.1592,  0.7227, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:52:05,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:52:05,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.82 | bwd_microstep: 1245.45 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1244.42 | step_microstep: 2.45
[2025-11-06 18:52:05,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.06 | bwd: 1246.50 | bwd_inner: 1.90 | bwd_allreduce: 1244.47 | step: 2.54
 78%|███████▊  | 2730/3507 [1:07:19<20:28,  1.58s/it]                                                     {'loss': 0.6578, 'learning_rate': 2.4670284067597316e-06, 'epoch': 0.78}
 78%|███████▊  | 2730/3507 [1:07:19<20:28,  1.58s/it]tensor([[-5.2812, -0.8867,  3.1250, -1.9766, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -5.2500, -2.5625,  1.6250, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -0.3242,  3.9062, -1.6953, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:05,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.87 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.1250, -2.5625,  1.8906,  3.4531, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5781,  0.1729,  2.5312, -0.1709, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -2.5000,  2.5938,  1.2031, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -2.1406,  2.1719,  0.9922, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -3.1094,  0.1167,  1.1328, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:52:06,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:52:06,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 121.88 | bwd_microstep: 378.50 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 376.96 | step_microstep: 1.61
[2025-11-06 18:52:06,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.78 | bwd: 379.45 | bwd_inner: 2.26 | bwd_allreduce: 377.02 | step: 1.71
 78%|███████▊  | 2731/3507 [1:07:20<17:26,  1.35s/it]                                                     {'loss': 0.5307, 'learning_rate': 2.4609564564151e-06, 'epoch': 0.78}
 78%|███████▊  | 2731/3507 [1:07:20<17:26,  1.35s/it]tensor([[-5.3750, -4.5312, -0.0825,  2.7188, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8750, -4.7188,  0.4941,  0.8008, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:06,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.80 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.0312, -5.3438, -0.6250,  2.5000, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -2.2188,  2.1719, -0.8320, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -5.6875, -2.3750,  1.8047, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -5.1562, -0.6914,  2.0781, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -0.0879,  2.5156, -1.1328, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -1.4297,  2.6406, -0.0688, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:52:08,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.35 | optimizer_step: 0.50
[2025-11-06 18:52:08,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.48 | bwd_microstep: 2138.78 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2137.49 | step_microstep: 3.95
[2025-11-06 18:52:08,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.31 | bwd: 2139.66 | bwd_inner: 1.96 | bwd_allreduce: 2137.54 | step: 4.01
 78%|███████▊  | 2732/3507 [1:07:22<21:57,  1.70s/it]                                                     {'loss': 0.1255, 'learning_rate': 2.454890938909764e-06, 'epoch': 0.78}
 78%|███████▊  | 2732/3507 [1:07:22<21:57,  1.70s/it]tensor([[-4.4375, -1.2812,  1.7578, -1.1094, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -2.0469,  1.0938, -1.9375, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -1.6562,  0.8984, -0.6055, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:08,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.06 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5625, -3.0625,  1.4609, -1.2656, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -1.2188,  2.5625, -0.9844, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.3750, -4.2812, -1.0312,  2.6406, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -2.6094,  0.6445,  1.6250, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -3.6562,  0.3926,  2.1094, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:52:09,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:52:09,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.20 | bwd_microstep: 55.81 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 54.95 | step_microstep: 1.93
[2025-11-06 18:52:09,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.29 | bwd: 56.54 | bwd_inner: 1.40 | bwd_allreduce: 54.99 | step: 2.02
 78%|███████▊  | 2733/3507 [1:07:23<17:03,  1.32s/it]                                                     {'loss': 1.0802, 'learning_rate': 2.4488318594192582e-06, 'epoch': 0.78}
 78%|███████▊  | 2733/3507 [1:07:23<17:03,  1.32s/it]tensor([[ 0.5156,  2.7188,  2.2812,  0.1270, -0.0664]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.6875, -1.1719,  1.8203,  0.1309, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -4.7500, -0.2812,  1.9062, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:09,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 209.54 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8438, -3.6250,  0.6719,  2.5781, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -2.4844,  1.1641,  1.3594, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.6562, -5.1562,  0.7695,  1.0938, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -0.2695,  2.5156, -1.5312, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4609,  2.0312,  3.6250, -0.0173, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:52:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.25
[2025-11-06 18:52:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.31 | bwd_microstep: 672.22 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 671.25 | step_microstep: 3.37
[2025-11-06 18:52:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.19 | bwd: 672.85 | bwd_inner: 1.40 | bwd_allreduce: 671.30 | step: 3.44
 78%|███████▊  | 2734/3507 [1:07:24<16:02,  1.25s/it]                                                     {'loss': 0.3624, 'learning_rate': 2.4427792231136047e-06, 'epoch': 0.78}
 78%|███████▊  | 2734/3507 [1:07:24<16:02,  1.25s/it]tensor([[-5.3125, -1.4453,  3.6250,  0.0294, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.4512,  0.6836,  4.3438,  5.7812,  0.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:10,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.46 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.7500, -5.1875, -2.0000,  2.5156, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -1.1484,  3.2656, -1.0156, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -2.2188,  3.1094,  0.0703, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -1.5781,  2.2188,  1.2188, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6094, -3.2500,  0.5117,  3.7969, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0625, -2.0781,  3.5156, -0.0104, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:52:13,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.38 | optimizer_gradients: 0.27 | optimizer_step: 0.22
[2025-11-06 18:52:13,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.83 | bwd_microstep: 369.26 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 368.15 | step_microstep: 4.01
[2025-11-06 18:52:13,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.31 | bwd: 370.22 | bwd_inner: 1.83 | bwd_allreduce: 368.22 | step: 4.13
 78%|███████▊  | 2735/3507 [1:07:27<22:56,  1.78s/it]                                                     {'loss': 0.0966, 'learning_rate': 2.436733035157337e-06, 'epoch': 0.78}
 78%|███████▊  | 2735/3507 [1:07:27<22:56,  1.78s/it]tensor([[-5.1562, -4.1250, -0.5391,  1.5156, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -2.8906,  0.7305, -0.8633, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:13,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.60 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.1562, -3.5469, -0.7773,  1.5625, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -0.6562,  2.8438, -2.1562, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -5.1562, -1.8125,  2.6406, -1.8203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -5.9688, -0.9766,  2.3594, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -3.7188,  0.0132,  2.5625, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.8125, -5.8438,  0.5898,  1.7969, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:52:14,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:52:14,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.29 | bwd_microstep: 1109.78 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1108.83 | step_microstep: 2.34
[2025-11-06 18:52:14,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.90 | bwd: 1110.72 | bwd_inner: 1.67 | bwd_allreduce: 1108.89 | step: 2.45
 78%|███████▊  | 2736/3507 [1:07:28<22:04,  1.72s/it]                                                     {'loss': 0.8253, 'learning_rate': 2.4306933007094834e-06, 'epoch': 0.78}
 78%|███████▊  | 2736/3507 [1:07:28<22:04,  1.72s/it]tensor([[-3.7031, -0.2402,  3.0312, -0.1660, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:15,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.89 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5000, -4.7500, -0.8672,  1.6406, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.0938, -3.4531,  0.5156,  3.4844, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.3125, -7.8125, -1.9766,  0.1436, -6.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2031,  1.5938,  2.9531, -1.8125, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3438, -4.6562,  0.4766,  2.0156, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -2.2031,  2.5938,  0.7422, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125,  0.9727,  3.3906, -0.8047, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:16,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.18 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 18:52:16,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.11 | bwd_microstep: 2.21 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 0.95 | step_microstep: 9.44
[2025-11-06 18:52:16,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.04 | bwd: 3.20 | bwd_inner: 2.04 | bwd_allreduce: 0.99 | step: 9.53
 78%|███████▊  | 2737/3507 [1:07:29<20:10,  1.57s/it]                                                     {'loss': 1.3565, 'learning_rate': 2.424660024923575e-06, 'epoch': 0.78}
 78%|███████▊  | 2737/3507 [1:07:29<20:10,  1.57s/it]tensor([[-8.1250, -4.5938,  1.0234, -1.3672, -6.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875, -2.3750,  0.8594,  2.0938, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:16,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 257.12 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.0625, -4.2812, -1.2734,  2.6094, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -4.3438,  0.4473,  2.2812, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2812,  0.5430,  2.4062, -2.2188, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -2.0938,  1.5547, -1.1328, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.9688, -3.9219,  0.4336,  2.5156, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -3.7969, -1.1172,  1.6719, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:52:16,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:52:16,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 333.04 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 332.13 | step_microstep: 2.00
[2025-11-06 18:52:16,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.82 | bwd: 333.80 | bwd_inner: 1.47 | bwd_allreduce: 332.18 | step: 2.08
 78%|███████▊  | 2738/3507 [1:07:30<17:18,  1.35s/it]                                                     {'loss': 0.8132, 'learning_rate': 2.4186332129476196e-06, 'epoch': 0.78}
 78%|███████▊  | 2738/3507 [1:07:30<17:18,  1.35s/it]tensor([[-3.8125, -0.8047,  3.0156,  0.8047, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -2.0938,  1.3750,  1.8828, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6250, -3.1562, -1.5547,  2.2031, -0.3613]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:52:17,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.30 | bwd_microstep: 9.49 | bwd_inner_microstep: 9.34 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.0156, -1.3359,  1.4141,  1.6406, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -5.2812, -0.5586,  2.4219, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8594, -2.9062, -1.7422,  2.7812,  0.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -3.5000, -0.1279, -2.7500, -5.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -3.1719,  1.1641,  2.3750, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:52:19,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.39 | optimizer_step: 0.32
[2025-11-06 18:52:19,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.28 | bwd_microstep: 46.84 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 45.16 | step_microstep: 2.76
[2025-11-06 18:52:19,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.61 | bwd: 56.32 | bwd_inner: 10.81 | bwd_allreduce: 45.25 | step: 2.86
 78%|███████▊  | 2739/3507 [1:07:33<20:39,  1.61s/it]                                                     {'loss': 0.2675, 'learning_rate': 2.4126128699241193e-06, 'epoch': 0.78}
 78%|███████▊  | 2739/3507 [1:07:33<20:39,  1.61s/it]tensor([[-2.9531,  0.0859,  3.2812,  0.7031, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -2.3125,  1.7188,  0.4727, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:19,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.54 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-1.6016,  1.4375,  2.2969, -1.0781, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -2.6562,  0.9492,  1.7344, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8125, -4.3125,  0.5898,  2.4219, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -4.1250, -1.5781,  2.5312, -1.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4062, -3.1875, -1.9688,  1.8047, -0.1943]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.8125,  0.2217,  3.5938, -0.8555, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:52:20,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:52:20,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.27 | bwd_microstep: 508.45 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 507.37 | step_microstep: 2.07
[2025-11-06 18:52:20,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.84 | bwd: 509.41 | bwd_inner: 1.78 | bwd_allreduce: 507.43 | step: 2.18
 78%|███████▊  | 2740/3507 [1:07:33<17:54,  1.40s/it]                                                     {'loss': 0.5471, 'learning_rate': 2.406599000990043e-06, 'epoch': 0.78}
 78%|███████▊  | 2740/3507 [1:07:33<17:54,  1.40s/it]tensor([[-6.4062, -3.4062,  2.0781,  0.6836, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0625, -5.9062, -0.2188,  2.5469, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:20,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.46 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4062, -4.4375, -0.5352,  1.6250, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6250, -4.2500,  0.4570,  0.2656, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -3.2500,  1.0312,  3.0781, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9062, -5.8438, -0.8789,  1.6641, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -0.6562,  2.4844, -1.5859, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -2.1406,  1.6484,  0.5664, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:52:23,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.33
[2025-11-06 18:52:23,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.08 | bwd_microstep: 3010.15 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 3009.07 | step_microstep: 2.44
[2025-11-06 18:52:23,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.57 | bwd: 3010.87 | bwd_inner: 1.59 | bwd_allreduce: 3009.12 | step: 2.52
 78%|███████▊  | 2741/3507 [1:07:37<25:46,  2.02s/it]                                                     {'loss': 0.1999, 'learning_rate': 2.4005916112768524e-06, 'epoch': 0.78}
 78%|███████▊  | 2741/3507 [1:07:37<25:46,  2.02s/it]tensor([[-4.5938, -0.9805,  2.8750, -0.5938, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:23,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.84 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.9688, -4.7500, -0.5938,  3.0469, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -2.5781,  1.5625,  0.6602, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -4.0000,  0.4961,  2.8125, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1562, -2.0156,  1.8750,  3.7656, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -4.9062, -2.0000,  2.3281, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -3.1875,  1.9062,  1.7891, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4531,  2.5625,  3.7656, -1.2266, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:52:23,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:52:23,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.68 | bwd_microstep: 64.36 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 63.23 | step_microstep: 1.81
[2025-11-06 18:52:23,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.53 | bwd: 65.31 | bwd_inner: 1.92 | bwd_allreduce: 63.26 | step: 1.89
 78%|███████▊  | 2742/3507 [1:07:37<19:30,  1.53s/it]                                                     {'loss': 0.3318, 'learning_rate': 2.39459070591047e-06, 'epoch': 0.78}
 78%|███████▊  | 2742/3507 [1:07:37<19:30,  1.53s/it]tensor([[-5.0312, -0.2373,  3.2656, -2.7188, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7188,  2.1562,  2.9062, -2.5625, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:24,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 99.15 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8125, -4.9062, -0.0576,  2.7969, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.9102,  1.9219,  1.7734, -1.1172, -1.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2812, -2.4219,  1.3750, -0.2256, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.4531,  1.2266,  3.4688, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0000, -4.7500,  0.2246,  0.2383, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -3.4375, -0.2676,  3.8281, -0.7227]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:52:25,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:52:25,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.95 | bwd_microstep: 1673.32 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 1672.42 | step_microstep: 1.91
[2025-11-06 18:52:25,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 229.11 | bwd: 1674.16 | bwd_inner: 1.56 | bwd_allreduce: 1672.47 | step: 1.99
 78%|███████▊  | 2743/3507 [1:07:39<21:01,  1.65s/it]                                                     {'loss': 0.279, 'learning_rate': 2.388596290011288e-06, 'epoch': 0.78}
 78%|███████▊  | 2743/3507 [1:07:39<21:01,  1.65s/it]tensor([[-6.1875, -2.9844,  2.1562,  0.4238, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -3.5000,  2.0625,  0.4473, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4688, -5.1250,  0.6211,  3.2344, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:26,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.75 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.8125, -4.4375, -0.6211,  2.7656, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1562,  0.4434,  0.5195, -2.2812, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.4062, -4.1250,  1.7891,  2.0781, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3438, -3.9688,  1.3828,  1.4531, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -4.8125,  0.5547,  2.3750, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:52:26,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:52:26,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.40 | bwd_microstep: 34.64 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 33.53 | step_microstep: 1.79
[2025-11-06 18:52:26,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.18 | bwd: 35.43 | bwd_inner: 1.68 | bwd_allreduce: 33.58 | step: 1.88
 78%|███████▊  | 2744/3507 [1:07:40<16:17,  1.28s/it]                                                     {'loss': 0.5495, 'learning_rate': 2.3826083686941614e-06, 'epoch': 0.78}
 78%|███████▊  | 2744/3507 [1:07:40<16:17,  1.28s/it]tensor([[-5.6875, -2.6719,  1.0703, -0.9570, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3750, -4.3438,  1.8203,  0.8047, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:26,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.52 | bwd_microstep: 4.74 | bwd_inner_microstep: 4.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7969, -3.9688, -0.9062,  3.1250, -1.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8125, -2.7500,  2.7500, -1.2500, -6.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -2.7500,  2.9062,  1.5547, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.2500,  0.1719,  2.3125, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -4.3750, -0.3828, -0.4102, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1094,  0.1641,  2.0469, -1.6484, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:52:29,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.29 | optimizer_step: 0.35
[2025-11-06 18:52:29,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.19 | bwd_microstep: 2243.09 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2241.96 | step_microstep: 2.71
[2025-11-06 18:52:29,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 461.71 | bwd: 2247.83 | bwd_inner: 5.67 | bwd_allreduce: 2242.01 | step: 2.78
 78%|███████▊  | 2745/3507 [1:07:42<21:53,  1.72s/it]                                                     {'loss': 0.6233, 'learning_rate': 2.3766269470684045e-06, 'epoch': 0.78}
 78%|███████▊  | 2745/3507 [1:07:42<21:53,  1.72s/it]tensor([[-5.4688, -1.5078,  2.4062, -1.6875, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -4.6250, -2.7812,  1.8750, -0.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1250, -5.8125,  0.4531,  1.0234, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:29,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.80 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2656,  0.1133,  3.0312, -0.5586, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -1.1875,  2.0469, -0.8320, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -2.7031,  2.2188, -0.5586, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -2.4844,  1.2500,  0.6094, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.8750, -0.3281,  2.2188, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:29,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:52:29,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.77 | bwd_microstep: 7.42 | bwd_inner_microstep: 6.44 | bwd_allreduce_microstep: 0.85 | step_microstep: 1.75
[2025-11-06 18:52:29,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.61 | bwd: 8.28 | bwd_inner: 7.21 | bwd_allreduce: 0.89 | step: 1.84
 78%|███████▊  | 2746/3507 [1:07:43<16:51,  1.33s/it]                                                     {'loss': 0.2474, 'learning_rate': 2.3706520302377823e-06, 'epoch': 0.78}
 78%|███████▊  | 2746/3507 [1:07:43<16:51,  1.33s/it]tensor([[-5.8750, -1.8359,  2.9844, -1.0625, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:29,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 103.78 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-1.8281,  1.8984,  3.1094, -1.2656, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.1875,  0.4336,  2.5000, -1.6719, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -2.0156,  1.3672, -0.5039, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -3.5000,  0.2432,  2.6562, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -3.5312, -0.2334,  2.5781, -1.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062e+00, -1.6172e+00,  3.2188e+00,  2.5940e-03, -4.8125e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -2.6719,  1.7344,  3.3281, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:52:31,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.31 | optimizer_step: 0.43
[2025-11-06 18:52:31,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.11 | bwd_microstep: 2078.35 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 2077.17 | step_microstep: 3.19
[2025-11-06 18:52:31,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 250.91 | bwd: 2079.26 | bwd_inner: 1.81 | bwd_allreduce: 2077.23 | step: 3.29
 78%|███████▊  | 2747/3507 [1:07:45<20:49,  1.64s/it]                                                     {'loss': 0.2805, 'learning_rate': 2.3646836233005133e-06, 'epoch': 0.78}
 78%|███████▊  | 2747/3507 [1:07:45<20:49,  1.64s/it]tensor([[-7.2500, -4.4062,  1.7969,  1.1094, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:31,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.04 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0625, -3.3906,  1.6641,  0.8164, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.0000,  1.4531,  1.4062, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8750,  1.5781,  3.1406, -1.2031, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4062, -0.1318,  2.8125, -0.7930, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -1.9141,  0.5039, -1.0469, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8438, -5.1875, -0.8008,  1.7969, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -5.7500, -0.6953,  2.6406, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:52:32,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:52:32,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.56 | bwd_microstep: 39.24 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 38.01 | step_microstep: 1.47
[2025-11-06 18:52:32,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.62 | bwd: 40.47 | bwd_inner: 2.25 | bwd_allreduce: 38.06 | step: 1.57
 78%|███████▊  | 2748/3507 [1:07:46<16:03,  1.27s/it]                                                     {'loss': 0.2733, 'learning_rate': 2.3587217313492572e-06, 'epoch': 0.78}
 78%|███████▊  | 2748/3507 [1:07:46<16:03,  1.27s/it]tensor([[-3.9062, -0.6172,  2.6719,  0.2070, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:32,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.49 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2188, -3.7656, -2.1406,  1.8672, -0.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -1.7109,  3.1719,  0.4258, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4141,  1.6953,  2.5625, -0.8125, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -3.8125, -1.0078,  2.0156, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281, -3.6562, -2.6094,  1.2188, -0.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -3.5469,  0.7070,  2.5625, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -3.6406,  0.6758,  2.0938, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:52:35,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:52:35,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.29 | bwd_microstep: 2726.45 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 2725.28 | step_microstep: 2.94
[2025-11-06 18:52:35,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.80 | bwd: 2727.28 | bwd_inner: 1.83 | bwd_allreduce: 2725.32 | step: 3.03
 78%|███████▊  | 2749/3507 [1:07:49<23:08,  1.83s/it]                                                     {'loss': 0.4747, 'learning_rate': 2.3527663594711225e-06, 'epoch': 0.78}
 78%|███████▊  | 2749/3507 [1:07:49<23:08,  1.83s/it]tensor([[-3.1406,  1.3047,  3.5156, -1.7734, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:35,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.19 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6875,  0.5508,  2.9531, -0.1865, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.7812, -6.0000, -1.0859,  1.9922, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1562, -1.9766,  0.4492,  3.1094, -0.3105]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3438, -1.5625,  1.1641,  0.6953, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -6.7812, -2.5938,  2.3438, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875, -4.3438, -0.7148,  3.7969, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4219, -1.2031,  1.4688,  0.8555, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:52:35,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.60 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:52:35,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.41 | bwd_microstep: 92.58 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 91.58 | step_microstep: 3.50
[2025-11-06 18:52:35,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.63 | bwd: 93.36 | bwd_inner: 1.61 | bwd_allreduce: 91.62 | step: 3.58
 78%|███████▊  | 2750/3507 [1:07:49<17:59,  1.43s/it]                                                     {'loss': 0.5605, 'learning_rate': 2.346817512747649e-06, 'epoch': 0.78}
 78%|███████▊  | 2750/3507 [1:07:49<17:59,  1.43s/it]tensor([[-2.0000, -3.0781, -2.6250,  1.4453,  0.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.6719, -0.2754,  2.8125,  0.0908, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:36,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.40 | bwd_microstep: 1.62 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14
tensor([[-4.0000, -1.2812,  3.0000,  1.4922, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500,  0.3906,  3.7031, -2.2969, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2969,  1.3047,  3.0312, -1.5391, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -4.0938,  0.1885,  2.7500, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438, -0.9375,  1.7422, -0.0320, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -4.3438, -0.7695,  3.4531, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:52:38,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.71 | optimizer_step: 0.60
[2025-11-06 18:52:38,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 112.58 | bwd_microstep: 2507.67 | bwd_inner_microstep: 2.15 | bwd_allreduce_microstep: 2505.27 | step_microstep: 6.10
[2025-11-06 18:52:38,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.01 | bwd: 2509.32 | bwd_inner: 3.61 | bwd_allreduce: 2505.40 | step: 6.21
 78%|███████▊  | 2751/3507 [1:07:52<23:26,  1.86s/it]                                                     {'loss': 0.2898, 'learning_rate': 2.3408751962548037e-06, 'epoch': 0.78}
 78%|███████▊  | 2751/3507 [1:07:52<23:26,  1.86s/it]tensor([[-4.0938, -4.3125, -1.3984,  2.6250, -1.3203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.5000, -0.4219,  2.3906, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -4.1250,  0.0618,  2.2656, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -1.1016,  1.8828, -0.7227, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:38,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.01 | bwd_microstep: 1.12 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0938, -2.1094,  2.8906, -0.7656, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0625e+00, -3.9688e+00,  1.3885e-03, -1.9453e+00, -5.8125e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -0.8867,  2.6719, -1.2969, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6406,  0.6914,  2.4844, -0.8398, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:52:39,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.88 | optimizer_gradients: 0.14 | optimizer_step: 0.19
[2025-11-06 18:52:39,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.11 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.81 | step_microstep: 4.04
[2025-11-06 18:52:39,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.14 | bwd: 3.06 | bwd_inner: 2.08 | bwd_allreduce: 0.84 | step: 4.13
 78%|███████▊  | 2752/3507 [1:07:52<17:58,  1.43s/it]                                                     {'loss': 0.5622, 'learning_rate': 2.3349394150629856e-06, 'epoch': 0.78}
 78%|███████▊  | 2752/3507 [1:07:52<17:58,  1.43s/it]tensor([[-7.4688, -4.9688, -1.5078, -2.1406, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -5.3125, -0.8867,  1.7109, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:39,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.37 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7188, -0.1211,  1.9609, -0.2373, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -3.1250,  1.2812,  1.3438, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -0.7070,  2.5625,  0.2285, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2188, -5.3125, -0.8320, -0.1406, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7344, -0.9805,  0.9609,  0.3027, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -3.3594,  0.0265,  3.1719, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:52:41,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 18:52:41,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.29 | bwd_microstep: 2387.62 | bwd_inner_microstep: 2.31 | bwd_allreduce_microstep: 2385.14 | step_microstep: 2.51
[2025-11-06 18:52:41,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.70 | bwd: 2388.60 | bwd_inner: 3.25 | bwd_allreduce: 2385.19 | step: 2.58
 79%|███████▊  | 2753/3507 [1:07:55<23:00,  1.83s/it]                                                     {'loss': 0.4029, 'learning_rate': 2.3290101742370243e-06, 'epoch': 0.79}
 79%|███████▊  | 2753/3507 [1:07:55<23:00,  1.83s/it]tensor([[-2.0312, -2.7344, -1.6641,  2.0938,  0.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -3.7344,  1.0000,  2.5938, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -2.4219,  2.2969, -0.8242, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:42,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.73 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1719,  1.3672,  4.0625, -1.9609, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4375, -4.8438,  0.4941,  2.2188, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -1.5469,  1.7891,  0.2373, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3438,  1.2188,  2.9844, -0.9414, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.0625, -3.6406,  2.3438,  0.0503, -5.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:52:42,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:52:42,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.25 | bwd_microstep: 182.74 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 181.54 | step_microstep: 2.61
[2025-11-06 18:52:42,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.97 | bwd: 183.66 | bwd_inner: 1.95 | bwd_allreduce: 181.58 | step: 2.69
 79%|███████▊  | 2754/3507 [1:07:56<18:14,  1.45s/it]                                                     {'loss': 0.3653, 'learning_rate': 2.3230874788361612e-06, 'epoch': 0.79}
 79%|███████▊  | 2754/3507 [1:07:56<18:14,  1.45s/it]tensor([[-4.0000, -1.1875,  1.3359, -0.6875, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:42,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.36 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9844, -3.5781, -0.0242,  3.3125, -1.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.6641,  1.2812,  4.6250,  4.6875,  0.0170]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7500, -1.1875,  1.8984,  0.3789, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.8125, -3.1406,  0.9492, -2.2188, -6.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375,  0.9648,  3.1719, -1.5781, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.2812, -4.2500,  1.3672,  1.9531, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438,  1.0781,  3.5312, -1.0703, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:52:44,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.89 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:52:44,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.32 | bwd_microstep: 1160.42 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1159.40 | step_microstep: 8.64
[2025-11-06 18:52:44,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.71 | bwd: 1161.20 | bwd_inner: 1.61 | bwd_allreduce: 1159.44 | step: 8.73
 79%|███████▊  | 2755/3507 [1:07:57<18:37,  1.49s/it]                                                     {'loss': 0.6298, 'learning_rate': 2.3171713339140554e-06, 'epoch': 0.79}
 79%|███████▊  | 2755/3507 [1:07:57<18:37,  1.49s/it]tensor([[-6.0312, -1.9375,  2.8594, -1.3203, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -1.3359,  1.9531, -2.5469, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.1250, -4.6562,  0.0894,  1.6094, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:44,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.50 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2188, -2.9531, -2.3750,  0.8008, -0.2002]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.3281, -3.2656, -1.8203,  2.5625,  0.1050]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4219, -2.9062, -1.8359,  1.4609, -0.3320]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -1.5781,  2.0156,  0.8672, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0938, -0.3379,  3.8438, -0.0203, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:52:44,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.49 | optimizer_step: 0.42
[2025-11-06 18:52:44,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.61 | bwd_microstep: 26.67 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 25.05 | step_microstep: 4.07
[2025-11-06 18:52:44,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.12 | bwd: 27.75 | bwd_inner: 2.37 | bwd_allreduce: 25.11 | step: 4.15
 79%|███████▊  | 2756/3507 [1:07:58<14:42,  1.18s/it]                                                     {'loss': 0.702, 'learning_rate': 2.311261744518769e-06, 'epoch': 0.79}
 79%|███████▊  | 2756/3507 [1:07:58<14:42,  1.18s/it]tensor([[-3.1875, -3.7500, -1.7109,  2.3125, -0.6523]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -4.7812, -1.2109,  1.9844, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:44,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.56 | bwd_microstep: 1.73 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.20
tensor([[-5.0625, -5.4375, -2.2812,  2.0781, -2.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -3.7656,  0.7852,  2.1562, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -0.9062,  2.5312, -0.4688, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -2.8906,  1.0000,  1.3203, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -4.0312,  0.2676,  2.3750, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -4.3750, -1.9062,  1.6875, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:52:47,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.34 | optimizer_step: 0.28
[2025-11-06 18:52:47,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.54 | bwd_microstep: 364.09 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 362.94 | step_microstep: 3.09
[2025-11-06 18:52:47,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.13 | bwd: 365.82 | bwd_inner: 2.45 | bwd_allreduce: 363.07 | step: 3.30
 79%|███████▊  | 2757/3507 [1:08:01<21:38,  1.73s/it]                                                     {'loss': 0.1523, 'learning_rate': 2.305358715692784e-06, 'epoch': 0.79}
 79%|███████▊  | 2757/3507 [1:08:01<21:38,  1.73s/it]tensor([[-5.0938, -3.0156,  1.4688,  1.7812, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -3.6250,  0.2061,  4.0000, -1.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -2.3906,  2.6875,  2.4219, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -4.8125, -2.3594,  1.3125, -1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:52:47,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.72 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2422,  1.6562,  3.3594,  1.0859, -1.3516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656,  0.0245,  3.0469, -1.3828, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906,  0.6094,  3.1094, -1.2422, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -4.3438,  1.4297,  1.6250, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:48,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.29 | optimizer_step: 0.19
[2025-11-06 18:52:48,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.55 | bwd_microstep: 3.73 | bwd_inner_microstep: 2.19 | bwd_allreduce_microstep: 1.40 | step_microstep: 2.24
[2025-11-06 18:52:48,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 520.32 | bwd: 4.62 | bwd_inner: 3.02 | bwd_allreduce: 1.42 | step: 2.34
 79%|███████▊  | 2758/3507 [1:08:01<17:16,  1.38s/it]                                                     {'loss': 0.7494, 'learning_rate': 2.2994622524729748e-06, 'epoch': 0.79}
 79%|███████▊  | 2758/3507 [1:08:01<17:16,  1.38s/it]tensor([[-5.5625, -1.1953,  2.4844, -2.5000, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:48,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.98 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.7812, -3.8125,  0.6211,  1.0781, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -1.2578,  3.4844, -1.5625, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -1.7109,  3.0156, -0.6680, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6797, -2.7031, -2.4062,  1.4766,  0.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -4.5938, -0.5977,  2.6875, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -4.6562, -0.3789,  1.3203, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1250, -5.8125, -1.2188,  0.9336, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:52:50,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:52:50,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.89 | bwd_microstep: 1291.56 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1290.42 | step_microstep: 3.01
[2025-11-06 18:52:50,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.85 | bwd: 1292.62 | bwd_inner: 2.01 | bwd_allreduce: 1290.46 | step: 3.08
 79%|███████▊  | 2759/3507 [1:08:04<22:47,  1.83s/it]                                                     {'loss': 0.4289, 'learning_rate': 2.2935723598906168e-06, 'epoch': 0.79}
 79%|███████▊  | 2759/3507 [1:08:04<22:47,  1.83s/it]tensor([[-6.9062, -3.1875,  2.6250, -0.1729, -5.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -3.3438,  0.4336,  1.2891, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9375,  1.5547,  2.7031, -1.0469, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:51,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.91 | bwd_microstep: 1.35 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.17
tensor([[-2.3750, -3.0312, -1.2031,  3.1719,  0.0879]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -2.6562,  1.0234,  1.3203, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.8125, -3.7812,  2.6719, -0.6211, -6.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -0.7227,  2.8594, -1.6484, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -1.7656,  2.0938, -0.9141, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:51,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:52:51,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.51 | bwd_microstep: 2.29 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 0.97 | step_microstep: 2.60
[2025-11-06 18:52:51,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.46 | bwd: 3.63 | bwd_inner: 2.41 | bwd_allreduce: 1.03 | step: 2.77
 79%|███████▊  | 2760/3507 [1:08:05<17:30,  1.41s/it]                                                     {'loss': 0.1902, 'learning_rate': 2.287689042971376e-06, 'epoch': 0.79}
 79%|███████▊  | 2760/3507 [1:08:05<17:30,  1.41s/it]tensor([[-2.9219,  1.5391,  4.1250, -1.7734, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:51,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.06 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4062, -1.5547,  2.4688, -1.3828, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.0781,  1.3516,  2.7188, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -3.6094,  0.9961,  1.0234, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1250, -4.9062, -0.8828,  0.8320, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6875, -5.6875, -0.7148,  2.0781, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -1.1562,  2.7656,  0.7305, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -0.6289,  3.8594, -0.2305, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:52:53,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.31 | optimizer_gradients: 0.22 | optimizer_step: 0.33
[2025-11-06 18:52:53,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.22 | bwd_microstep: 2067.32 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 2065.79 | step_microstep: 5.69
[2025-11-06 18:52:53,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.33 | bwd: 2068.27 | bwd_inner: 2.22 | bwd_allreduce: 2065.83 | step: 5.77
 79%|███████▊  | 2761/3507 [1:08:07<21:20,  1.72s/it]                                                     {'loss': 0.3553, 'learning_rate': 2.2818123067353172e-06, 'epoch': 0.79}
 79%|███████▊  | 2761/3507 [1:08:07<21:20,  1.72s/it]tensor([[-5.9062, -2.1406,  3.3438,  0.1206, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -4.1875,  0.1846,  2.3906, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:54,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.48 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -5.2500, -1.0547,  2.1719, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -4.9375,  1.1406,  2.0625, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8320,  2.0312,  2.1250, -1.4688, -1.6484]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -1.7031,  2.6875, -0.2754, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -3.7812,  0.8984,  2.4062, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -3.1719,  2.3594,  2.5781, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:52:55,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.20 | optimizer_step: 0.26
[2025-11-06 18:52:55,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.44 | bwd_microstep: 1039.69 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 1038.89 | step_microstep: 2.88
[2025-11-06 18:52:55,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 233.93 | bwd: 1040.41 | bwd_inner: 1.32 | bwd_allreduce: 1038.94 | step: 2.96
 79%|███████▉  | 2762/3507 [1:08:08<19:49,  1.60s/it]                                                     {'loss': 0.7529, 'learning_rate': 2.275942156196875e-06, 'epoch': 0.79}
 79%|███████▉  | 2762/3507 [1:08:08<19:49,  1.60s/it]tensor([[-5.6875, -5.3750, -0.9258,  3.1875, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -2.5469,  2.3750,  1.7422, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7812,  1.8203,  2.8125, -1.9531, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:52:55,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.05 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -2.9531,  1.1406,  2.6875, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -4.8750, -2.2031,  2.7188, -1.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -1.0625,  3.5469,  0.9688, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -2.7031,  0.7383,  1.2109, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -0.6992,  2.3750, -0.2158, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:52:57,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.43 | optimizer_step: 0.37
[2025-11-06 18:52:57,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.76 | bwd_microstep: 1825.07 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1823.96 | step_microstep: 3.96
[2025-11-06 18:52:57,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.83 | bwd: 1825.77 | bwd_inner: 1.52 | bwd_allreduce: 1824.04 | step: 4.05
 79%|███████▉  | 2763/3507 [1:08:11<22:12,  1.79s/it]                                                     {'loss': 0.275, 'learning_rate': 2.270078596364875e-06, 'epoch': 0.79}
 79%|███████▉  | 2763/3507 [1:08:11<22:12,  1.79s/it]tensor([[-6.0625, -4.3438,  0.7500,  2.2188, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875,  0.0515,  3.4375, -1.4062, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -0.1289,  3.3594, -0.7461, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[ 0.4629, -0.6367, -0.6992,  3.0000,  2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:57,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.40 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2812, -3.6719,  0.7930,  1.9531, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0938, -0.9219,  3.4375,  3.5938, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.2812,  0.9805,  1.4141, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1875,  0.0903,  3.3906, -1.6719, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:52:58,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:52:58,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.19 | bwd_microstep: 638.14 | bwd_inner_microstep: 1.67 | bwd_allreduce_microstep: 636.29 | step_microstep: 2.01
[2025-11-06 18:52:58,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.58 | bwd: 638.80 | bwd_inner: 2.25 | bwd_allreduce: 636.32 | step: 2.09
 79%|███████▉  | 2764/3507 [1:08:12<19:31,  1.58s/it]                                                     {'loss': 0.2919, 'learning_rate': 2.264221632242515e-06, 'epoch': 0.79}
 79%|███████▉  | 2764/3507 [1:08:12<19:31,  1.58s/it]tensor([[-4.7500, -4.0312, -0.3320,  2.2500, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:52:58,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.09 | bwd_microstep: 1.13 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
tensor([[-5.3438, -4.0000,  0.4082,  1.9062, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3438, -2.3750,  2.9531, -0.7031, -5.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -1.8672,  3.3281,  0.3027, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3438, -2.2344,  2.4531,  0.6836, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -3.8906,  1.3828,  2.7812, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -5.6875, -1.2422,  2.0469, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -2.1562,  1.5469,  0.9922, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:01,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.34 | optimizer_step: 0.36
[2025-11-06 18:53:01,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.77 | bwd_microstep: 3089.85 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 3088.89 | step_microstep: 3.34
[2025-11-06 18:53:01,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.87 | bwd: 3091.01 | bwd_inner: 1.77 | bwd_allreduce: 3088.99 | step: 3.46
 79%|███████▉  | 2765/3507 [1:08:15<26:27,  2.14s/it]                                                     {'loss': 0.1605, 'learning_rate': 2.2583712688273706e-06, 'epoch': 0.79}
 79%|███████▉  | 2765/3507 [1:08:15<26:27,  2.14s/it]tensor([[-4.5625, -0.8945,  2.4219, -1.4609, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:02,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.75 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.7500, -4.4688, -0.4297,  1.3125, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.4688,  0.6875,  2.1250, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -0.6992,  3.4688, -1.4219, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -0.4355,  3.2188, -0.5391, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8516, -0.5234,  1.5547,  2.0312, -0.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -4.6250, -0.7305,  2.4219, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -1.1172,  2.9375, -1.0781, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:02,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 18:53:02,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.48 | bwd_microstep: 227.88 | bwd_inner_microstep: 2.89 | bwd_allreduce_microstep: 224.82 | step_microstep: 3.31
[2025-11-06 18:53:02,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 249.26 | bwd: 228.83 | bwd_inner: 3.79 | bwd_allreduce: 224.86 | step: 3.38
 79%|███████▉  | 2766/3507 [1:08:16<20:24,  1.65s/it]                                                     {'loss': 0.1476, 'learning_rate': 2.252527511111381e-06, 'epoch': 0.79}
 79%|███████▉  | 2766/3507 [1:08:16<20:24,  1.65s/it][18:53:02] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch19/coftn-FRANKLIN_INSIDER_-_MyPilgrvideoPal.mp4, No such file or directory
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch19/coftn-FRANKLIN_INSIDER_-_MyPilgrvideoPal.mp4... sharegpt4v_instruct_gpt4-vision_cap100k
Traceback (most recent call last):
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 718, in __getitem__
    ret=self.video_get_item(data_item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 604, in video_get_item
    image_list,frame_indices = self.load_video(video_path)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 582, in load_video
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/miniconda3/envs/visualquality/lib/python3.11/site-packages/decord/video_reader.py", line 57, in __init__
    raise RuntimeError("Error reading " + uri + "...")
RuntimeError: Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch19/coftn-FRANKLIN_INSIDER_-_MyPilgrvideoPal.mp4...
tensor([[-4.6250, -2.9688,  1.5469,  2.5781, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4688, -6.0312, -1.4062,  2.2344, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -1.9062,  3.0469, -0.4316, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:02,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.83 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.7188, -5.6875, -0.8359,  1.9453, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0938,  0.3105,  1.5703, -0.0156, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.4062, -4.4062, -0.7109,  2.9844, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -3.6250, -0.0859,  3.1719, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9062, -1.6719,  2.1562,  3.9531, -1.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:04,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:53:04,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.34 | bwd_microstep: 1859.46 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1858.44 | step_microstep: 3.27
[2025-11-06 18:53:04,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.20 | bwd: 1860.42 | bwd_inner: 1.79 | bwd_allreduce: 1858.50 | step: 3.36
 79%|███████▉  | 2767/3507 [1:08:18<23:10,  1.88s/it]                                                     {'loss': 0.298, 'learning_rate': 2.2466903640808444e-06, 'epoch': 0.79}
 79%|███████▉  | 2767/3507 [1:08:18<23:10,  1.88s/it]tensor([[-6.9375, -4.0938,  2.2344,  1.4375, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -3.3906,  0.8633,  3.2500, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.5000, -6.3438, -1.0000, -0.5078, -6.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -2.4688,  1.5938,  1.0391, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:05,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.89 | bwd_microstep: 1.44 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.17
tensor([[-4.5938, -4.0625, -0.6016,  2.2500, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3125, -4.1875, -2.8906,  1.2422, -0.7461]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -5.0000, -1.4375,  2.3125, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -5.0000, -3.0000,  1.0000, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:05,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.35 | optimizer_step: 0.43
[2025-11-06 18:53:05,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.08 | bwd_microstep: 5.30 | bwd_inner_microstep: 3.19 | bwd_allreduce_microstep: 1.89 | step_microstep: 3.09
[2025-11-06 18:53:05,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.98 | bwd: 6.76 | bwd_inner: 4.47 | bwd_allreduce: 1.98 | step: 3.27
 79%|███████▉  | 2768/3507 [1:08:19<17:45,  1.44s/it]                                                     {'loss': 0.2826, 'learning_rate': 2.2408598327164234e-06, 'epoch': 0.79}
 79%|███████▉  | 2768/3507 [1:08:19<17:45,  1.44s/it]tensor([[-3.9531,  0.8438,  3.7812, -3.0000, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -4.1875,  0.5547,  3.8906, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -3.5000,  1.0469,  1.3750, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:05,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.20 | bwd_microstep: 2.41 | bwd_inner_microstep: 2.07 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.21
tensor([[-7.9688, -4.5625,  1.1406, -0.6172, -6.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.4375, -6.1875, -1.8594, -0.0271, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6250, -4.2812,  1.2656,  1.5625, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -5.1250, -2.0938,  1.7266, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -0.7969,  3.7969, -1.5078, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:08,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.29 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:53:08,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.58 | bwd_microstep: 2663.64 | bwd_inner_microstep: 1.39 | bwd_allreduce_microstep: 2662.16 | step_microstep: 3.90
[2025-11-06 18:53:08,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.81 | bwd: 2666.04 | bwd_inner: 3.50 | bwd_allreduce: 2662.27 | step: 4.10
 79%|███████▉  | 2769/3507 [1:08:22<23:52,  1.94s/it]                                                     {'loss': 0.2343, 'learning_rate': 2.2350359219931393e-06, 'epoch': 0.79}
 79%|███████▉  | 2769/3507 [1:08:22<23:52,  1.94s/it]tensor([[-4.8125, -4.0625, -0.1445,  2.5781, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -6.1875, -0.7422,  2.0781, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:08,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.80 | bwd_microstep: 2.28 | bwd_inner_microstep: 2.03 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.18
tensor([[-5.7188, -3.2812,  1.1562,  1.0391, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -5.4688, -1.3516,  2.7344, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8750, -3.8750,  0.8516, -0.7930, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9375,  1.6094,  3.9062, -2.3750, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.9688, -3.3750,  1.5703,  0.9219, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938, -2.4688,  1.3828,  3.3594, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:08,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.36 | optimizer_step: 0.33
[2025-11-06 18:53:08,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.22 | bwd_microstep: 4.79 | bwd_inner_microstep: 2.56 | bwd_allreduce_microstep: 2.02 | step_microstep: 3.31
[2025-11-06 18:53:08,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.06 | bwd: 7.07 | bwd_inner: 4.61 | bwd_allreduce: 2.11 | step: 3.49
 79%|███████▉  | 2770/3507 [1:08:22<18:20,  1.49s/it]                                                     {'loss': 0.6979, 'learning_rate': 2.2292186368803582e-06, 'epoch': 0.79}
 79%|███████▉  | 2770/3507 [1:08:22<18:20,  1.49s/it]tensor([[-5.5938, -2.3906,  2.1094, -0.0133, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.4062, -1.2891,  3.3438, -1.1094, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')tensor([2], device='cuda:0')

tensor([[-2.9375, -2.6250, -0.0115,  2.5469, -0.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:53:08,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.80 | bwd_microstep: 1.43 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.18
tensor([[-7.1250, -7.2188, -2.9219,  1.5703, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -4.6875, -0.0304,  2.0312, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -5.0000, -1.4844,  2.3750, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.1562, -2.7344, -1.9766,  1.1875, -0.1357]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.4219,  0.3027,  1.5703, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:11,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:53:11,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.85 | bwd_microstep: 2132.56 | bwd_inner_microstep: 2.17 | bwd_allreduce_microstep: 2130.15 | step_microstep: 2.14
[2025-11-06 18:53:11,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.67 | bwd: 2133.98 | bwd_inner: 3.41 | bwd_allreduce: 2130.22 | step: 2.32
 79%|███████▉  | 2771/3507 [1:08:25<22:14,  1.81s/it]                                                     {'loss': 0.125, 'learning_rate': 2.223407982341793e-06, 'epoch': 0.79}
 79%|███████▉  | 2771/3507 [1:08:25<22:14,  1.81s/it]tensor([[-4.8438, -4.9688, -1.0000,  3.4219, -1.8516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6094,  3.0000,  4.5938, -1.7578, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0938, -2.8594,  0.0459,  3.0625, -1.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:11,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.88 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.5938, -6.0625, -1.6875,  1.7266, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -0.6016,  3.5469, -2.1562, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6094, -0.6562,  1.6406, -0.4766, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -4.5000,  0.1230,  2.6875, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -2.8281,  1.1719, -0.2285, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:11,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.16 | optimizer_step: 0.21
[2025-11-06 18:53:11,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.08 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.91 | step_microstep: 3.22
[2025-11-06 18:53:11,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.00 | bwd: 2.95 | bwd_inner: 1.84 | bwd_allreduce: 0.96 | step: 3.33
 79%|███████▉  | 2772/3507 [1:08:25<17:05,  1.40s/it]                                                     {'loss': 0.4834, 'learning_rate': 2.217603963335504e-06, 'epoch': 0.79}
 79%|███████▉  | 2772/3507 [1:08:25<17:05,  1.40s/it]tensor([[-2.6094, -3.5469, -2.9219,  0.8867, -0.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.6875, -2.8906,  1.5938,  2.3906, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -4.1875,  0.1963,  1.1875, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:12,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.91 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5938, -4.8750, -0.0410,  0.8789, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031, -3.0312,  0.4277,  2.9688, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1250, -2.2656,  2.1562, -1.4688, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9062, -3.5469,  1.4766,  1.5078, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312e+00, -1.7578e+00,  2.3906e+00,  1.2817e-03, -4.3125e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:15,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:53:15,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.80 | bwd_microstep: 3213.91 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 3212.95 | step_microstep: 2.42
[2025-11-06 18:53:15,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.74 | bwd: 3214.56 | bwd_inner: 1.42 | bwd_allreduce: 3213.00 | step: 2.50
 79%|███████▉  | 2773/3507 [1:08:29<25:22,  2.07s/it]                                                     {'loss': 0.5449, 'learning_rate': 2.2118065848138838e-06, 'epoch': 0.79}
 79%|███████▉  | 2773/3507 [1:08:29<25:22,  2.07s/it]tensor([[-2.8125,  0.2871, -0.0688, -3.4219, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.0938,  2.2500,  4.0312, -1.6406, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:15,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.62 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3438, -4.4062, -0.7969,  3.2969, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.7500, -2.6875,  1.5156, -0.4160, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -2.1562,  2.4688,  1.7812, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1719,  1.2109,  2.5938, -1.6797, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -3.6562,  0.5742,  1.4062, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5938, -4.0312,  1.2031, -1.0859, -6.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:53:15,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:53:15,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.64 | bwd_microstep: 125.20 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 124.37 | step_microstep: 2.09
[2025-11-06 18:53:15,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 261.27 | bwd: 126.17 | bwd_inner: 1.59 | bwd_allreduce: 124.42 | step: 2.16
 79%|███████▉  | 2774/3507 [1:08:29<19:16,  1.58s/it]                                                     {'loss': 0.8641, 'learning_rate': 2.2060158517236606e-06, 'epoch': 0.79}
 79%|███████▉  | 2774/3507 [1:08:29<19:16,  1.58s/it]tensor([[-5.9062, -2.9844,  1.4844, -0.0757, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.5859,  2.0312,  2.6719, -1.7578, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:53:16,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.38 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9844, -3.0781,  0.3340,  2.4531, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -5.1250, -0.7383,  2.1875, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -4.9062, -0.5938,  1.0547, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -5.4688, -0.4980,  2.9531, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -3.5781,  0.3359,  2.4531, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -3.7812,  1.8438,  0.1729, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:18,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:53:18,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 337.90 | bwd_microstep: 1827.62 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1826.73 | step_microstep: 2.01
[2025-11-06 18:53:18,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 469.31 | bwd: 1828.57 | bwd_inner: 1.62 | bwd_allreduce: 1826.79 | step: 2.10
 79%|███████▉  | 2775/3507 [1:08:32<22:02,  1.81s/it]                                                     {'loss': 0.2491, 'learning_rate': 2.200231769005895e-06, 'epoch': 0.79}
 79%|███████▉  | 2775/3507 [1:08:32<22:02,  1.81s/it]tensor([[-6.3125, -4.0312,  1.1016,  1.3438, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -0.7031,  1.8125, -1.3281, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.8125, -6.7500, -3.2188,  0.8203, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:18,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.23 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5938, -4.1250,  0.8242,  2.2500, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -2.8906,  0.3008,  1.9375, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875,  0.4922,  3.5156, -0.2461, -3.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -3.0781,  1.4766,  1.4766, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.9062, -6.1562, -0.8086,  2.6875, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:53:18,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:53:18,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.87 | bwd_microstep: 3.04 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 2.20 | step_microstep: 1.68
[2025-11-06 18:53:18,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.12 | bwd: 3.74 | bwd_inner: 1.39 | bwd_allreduce: 2.24 | step: 1.76
 79%|███████▉  | 2776/3507 [1:08:32<16:58,  1.39s/it]                                                     {'loss': 0.5606, 'learning_rate': 2.1944543415959675e-06, 'epoch': 0.79}
 79%|███████▉  | 2776/3507 [1:08:32<16:58,  1.39s/it]tensor([[-3.9062, -2.9219,  0.6133,  2.8125, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -4.2812, -1.3906,  2.2969, -1.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8125, -1.7812,  0.6055,  3.7188,  0.0630]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:18,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.96 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-8.6250, -6.7500, -2.0000, -1.1797, -6.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -3.1094,  1.3906,  1.6406, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0312, -6.0625, -1.0938,  1.7344, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6250, -4.1562,  1.2969,  1.4062, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -1.5625,  2.4062, -0.2812, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:21,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 18:53:21,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.34 | bwd_microstep: 2692.83 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 2691.81 | step_microstep: 2.32
[2025-11-06 18:53:21,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.34 | bwd: 2693.44 | bwd_inner: 1.43 | bwd_allreduce: 2691.85 | step: 2.39
 79%|███████▉  | 2777/3507 [1:08:35<23:12,  1.91s/it]                                                     {'loss': 0.2482, 'learning_rate': 2.1886835744235913e-06, 'epoch': 0.79}
 79%|███████▉  | 2777/3507 [1:08:35<23:12,  1.91s/it]tensor([[-6.5312, -5.9688, -1.2344,  2.2188, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -1.3281,  3.4062,  0.6133, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -0.6328,  1.7500, -1.8438, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6250, -2.3125, -0.3086,  1.8203, -0.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:53:21,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.95 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0625, -2.7500,  0.8438,  0.2354, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -3.5938,  2.1562,  1.6641, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -2.9531,  1.3359,  1.5391, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031,  1.4766,  4.4375, -2.8750, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:22,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:53:22,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.31 | bwd_microstep: 87.44 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 86.30 | step_microstep: 1.59
[2025-11-06 18:53:22,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.28 | bwd: 88.76 | bwd_inner: 2.31 | bwd_allreduce: 86.33 | step: 1.67
 79%|███████▉  | 2778/3507 [1:08:36<18:07,  1.49s/it]                                                     {'loss': 0.6772, 'learning_rate': 2.18291947241278e-06, 'epoch': 0.79}
 79%|███████▉  | 2778/3507 [1:08:36<18:07,  1.49s/it]tensor([[-3.2969, -3.1250, -0.1099,  3.1719, -1.0703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8594,  1.1094,  3.4688, -1.1016, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:22,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.27 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5625, -4.0938,  1.1094,  0.9180, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3750, -1.0703,  2.4062,  3.7188, -0.9805]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.1562, -6.0625, -0.3770,  2.5625, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -3.5781,  1.1250,  0.5039, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -3.8594, -0.3438,  2.3438, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1562, -4.7188, -0.0664, -0.5977, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:24,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:53:24,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.54 | bwd_microstep: 1501.61 | bwd_inner_microstep: 2.55 | bwd_allreduce_microstep: 1498.97 | step_microstep: 1.87
[2025-11-06 18:53:24,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.84 | bwd: 1502.28 | bwd_inner: 3.13 | bwd_allreduce: 1499.01 | step: 1.95
 79%|███████▉  | 2779/3507 [1:08:38<19:37,  1.62s/it]                                                     {'loss': 0.2814, 'learning_rate': 2.1771620404818716e-06, 'epoch': 0.79}
 79%|███████▉  | 2779/3507 [1:08:38<19:37,  1.62s/it]tensor([[-0.2451,  3.7500,  5.0625, -0.4629, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -1.9141,  1.1406, -1.7812, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -2.6875,  0.6797,  1.3516, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2812,  0.7305,  1.6484, -1.1562, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:53:24,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.52 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9688, -1.4688,  2.1250,  1.0625, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -3.7969,  1.3750,  1.3516, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281, -1.3594,  1.6484,  2.0781, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4375, -4.8125, -2.0625,  2.1094, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:24,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:53:24,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.39 | bwd_microstep: 27.54 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 26.59 | step_microstep: 1.47
[2025-11-06 18:53:24,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.93 | bwd: 28.39 | bwd_inner: 1.64 | bwd_allreduce: 26.63 | step: 1.55
 79%|███████▉  | 2780/3507 [1:08:38<15:13,  1.26s/it]                                                     {'loss': 0.4646, 'learning_rate': 2.1714112835435076e-06, 'epoch': 0.79}
 79%|███████▉  | 2780/3507 [1:08:38<15:13,  1.26s/it]tensor([[-5.5312, -2.8438,  0.8828, -0.6094, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.0000, -6.5312, -1.4453,  0.4863, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:24,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.05 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -4.6875, -2.1406,  2.0469, -1.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -1.0547,  1.9688, -0.7891, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -3.4844,  0.4883,  2.4375, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -2.3438,  2.9688,  0.1855, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -5.0938, -1.6172,  2.7031, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -2.4219,  2.0469, -2.3750, -6.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:26,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.30 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:53:26,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.57 | bwd_microstep: 1395.97 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 1394.68 | step_microstep: 3.33
[2025-11-06 18:53:26,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.65 | bwd: 1396.87 | bwd_inner: 2.03 | bwd_allreduce: 1394.72 | step: 3.40
 79%|███████▉  | 2781/3507 [1:08:40<17:03,  1.41s/it]                                                     {'loss': 0.0977, 'learning_rate': 2.165667206504641e-06, 'epoch': 0.79}
 79%|███████▉  | 2781/3507 [1:08:40<17:03,  1.41s/it]tensor([[-6.5000, -5.0000, -0.4492,  0.8320, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -4.8438, -0.9727,  3.5000, -1.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:26,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.96 | bwd_microstep: 1.22 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.3438, -0.9844,  1.9297,  0.7617, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.2656, -2.2656, -1.7891,  1.9766,  0.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -4.0938,  0.3809,  2.4219, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -4.0938,  0.3516,  2.5781, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1875, -2.2500,  2.7656, -1.1328, -5.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.6562,  1.0469,  1.5156, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:27,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:53:27,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.22 | bwd_microstep: 338.17 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 337.10 | step_microstep: 2.36
[2025-11-06 18:53:27,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.22 | bwd: 339.38 | bwd_inner: 2.02 | bwd_allreduce: 337.15 | step: 2.46
 79%|███████▉  | 2782/3507 [1:08:40<14:37,  1.21s/it]                                                     {'loss': 0.2098, 'learning_rate': 2.159929814266517e-06, 'epoch': 0.79}
 79%|███████▉  | 2782/3507 [1:08:40<14:37,  1.21s/it]tensor([[-7.0312, -3.8125,  2.0469,  0.2227, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:27,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.42 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.8438, -4.4062,  0.4980,  2.4062, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -5.9062, -1.5547,  2.5156, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.9688, -4.4688,  2.0156,  0.1357, -6.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -1.5703,  1.7188,  1.4453, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6875, -3.5000,  1.1719,  1.2422, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -2.5781,  1.4609,  2.7500, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.9062, -4.0625,  1.1953, -2.0938, -6.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:29,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:53:29,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.75 | bwd_microstep: 2509.77 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2508.70 | step_microstep: 2.32
[2025-11-06 18:53:29,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.19 | bwd: 2510.63 | bwd_inner: 1.76 | bwd_allreduce: 2508.75 | step: 2.39
 79%|███████▉  | 2783/3507 [1:08:43<20:35,  1.71s/it]                                                     {'loss': 0.2865, 'learning_rate': 2.154199111724684e-06, 'epoch': 0.79}
 79%|███████▉  | 2783/3507 [1:08:43<20:35,  1.71s/it]tensor([[-0.9766,  2.0156,  2.4062, -1.2969, -1.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:30,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.99 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8750, -3.8281, -0.1670,  1.7344, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750,  0.3887,  2.6250, -1.4375, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -3.7812,  0.7305,  3.1719, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[ 0.4453,  3.8125,  4.9375,  0.8242, -0.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.2500, -5.1875, -0.4473,  1.7031, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -2.3750,  1.8672,  2.0781, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5000, -4.4062,  1.4141,  2.1875, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:30,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:53:30,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.91 | bwd_microstep: 223.82 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 222.37 | step_microstep: 1.84
[2025-11-06 18:53:30,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 271.91 | bwd: 224.76 | bwd_inner: 2.23 | bwd_allreduce: 222.40 | step: 1.92
 79%|███████▉  | 2784/3507 [1:08:44<16:18,  1.35s/it]                                                     {'loss': 0.3172, 'learning_rate': 2.148475103768969e-06, 'epoch': 0.79}
 79%|███████▉  | 2784/3507 [1:08:44<16:18,  1.35s/it]tensor([[-6.1875, -4.6562, -0.3008,  0.9102, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:30,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.52 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.3438, -4.8125, -0.5664,  2.7812, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281,  1.2422,  2.1875, -3.9688, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -1.0938,  3.0469, -0.3535, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -5.0625, -1.6719,  2.0156, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -5.2500, -2.9531,  1.7578, -1.5391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6875, -4.2812,  1.6641,  1.6641, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -1.1250,  1.1094, -0.5430, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:34,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:53:34,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.74 | bwd_microstep: 3237.48 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 3236.12 | step_microstep: 2.35
[2025-11-06 18:53:34,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 259.26 | bwd: 3238.29 | bwd_inner: 2.00 | bwd_allreduce: 3236.16 | step: 2.42
 79%|███████▉  | 2785/3507 [1:08:47<24:08,  2.01s/it]                                                     {'loss': 0.3712, 'learning_rate': 2.1427577952835044e-06, 'epoch': 0.79}
 79%|███████▉  | 2785/3507 [1:08:47<24:08,  2.01s/it]tensor([[-7.7812, -5.6562, -0.9688, -0.5508, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -2.1562,  3.1250,  0.1533, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2188, -5.3125, -0.0630,  3.0469, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:34,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.00 | bwd_microstep: 2.99 | bwd_inner_microstep: 2.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-6.3125, -2.2344,  3.3750, -0.4453, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -4.6250,  0.0236,  4.0000, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -2.4531,  2.2500,  1.3047, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5625, -6.3750, -1.8047,  2.3125, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750,  0.0493,  2.7969, -1.7656, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:53:34,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.39 | optimizer_step: 0.20
[2025-11-06 18:53:34,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.68 | bwd_microstep: 250.67 | bwd_inner_microstep: 5.61 | bwd_allreduce_microstep: 244.94 | step_microstep: 2.41
[2025-11-06 18:53:34,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.72 | bwd: 253.66 | bwd_inner: 8.46 | bwd_allreduce: 245.01 | step: 2.50
 79%|███████▉  | 2786/3507 [1:08:48<19:12,  1.60s/it]                                                     {'loss': 0.1368, 'learning_rate': 2.137047191146696e-06, 'epoch': 0.79}
 79%|███████▉  | 2786/3507 [1:08:48<19:12,  1.60s/it]tensor([[-3.6875, -0.4570,  3.4531,  0.5430, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -4.0000, -0.9453,  1.9453, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:34,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.38 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.5625, -2.3750,  2.3594,  0.4395, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -5.1250, -1.4219,  3.1719, -1.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.4512,  2.2812,  5.2500,  2.6719, -0.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5625, -4.5312,  1.1094, -0.1895, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -3.1719,  2.0000, -0.0825, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0625,  0.1924,  2.5469, -0.8008, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:35,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:53:35,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.50 | bwd_microstep: 765.00 | bwd_inner_microstep: 6.35 | bwd_allreduce_microstep: 758.52 | step_microstep: 2.35
[2025-11-06 18:53:35,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.90 | bwd: 765.77 | bwd_inner: 7.02 | bwd_allreduce: 758.55 | step: 2.44
 79%|███████▉  | 2787/3507 [1:08:49<17:23,  1.45s/it]                                                     {'loss': 0.4392, 'learning_rate': 2.1313432962312287e-06, 'epoch': 0.79}
 79%|███████▉  | 2787/3507 [1:08:49<17:23,  1.45s/it]tensor([[-5.2812, -4.8750, -0.5273,  3.3281, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -5.1562, -1.5547,  2.7344, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.8125, -4.1562,  0.4805, -0.0688, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:53:36,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 9.67 | bwd_inner_microstep: 9.53 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.7188, -0.3691,  2.4062, -0.7852, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3594,  1.0312,  2.2188, -1.6875, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -4.6875, -0.0854,  2.2031, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -4.4062, -1.3125,  2.3906, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.2812, -2.5781, -0.8086,  2.6094, -0.1826]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:36,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:53:36,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.69 | bwd_microstep: 704.74 | bwd_inner_microstep: 5.15 | bwd_allreduce_microstep: 699.49 | step_microstep: 2.48
[2025-11-06 18:53:36,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.38 | bwd: 714.41 | bwd_inner: 14.69 | bwd_allreduce: 699.56 | step: 2.58
 79%|███████▉  | 2788/3507 [1:08:50<16:15,  1.36s/it]                                                     {'loss': 0.1441, 'learning_rate': 2.1256461154040653e-06, 'epoch': 0.79}
 79%|███████▉  | 2788/3507 [1:08:50<16:15,  1.36s/it]tensor([[-3.2188, -3.9844, -2.5000,  1.8359, -0.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -4.0625, -0.3613,  1.7109, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:37,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.75 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.3594,  2.0625,  3.5625, -2.5625, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.4688, -1.4453,  1.4141, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -1.7031,  2.1250,  0.1475, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -4.5625,  0.1973,  2.5312, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6250, -3.5312, -1.1172,  3.9375,  0.1270]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -4.2500,  1.7891,  1.7422, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:37,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 18:53:37,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 115.88 | bwd_microstep: 285.98 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 284.81 | step_microstep: 1.94
[2025-11-06 18:53:37,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.65 | bwd: 286.89 | bwd_inner: 1.86 | bwd_allreduce: 284.86 | step: 2.03
 80%|███████▉  | 2789/3507 [1:08:51<13:36,  1.14s/it]                                                     {'loss': 0.1903, 'learning_rate': 2.11995565352644e-06, 'epoch': 0.8}
 80%|███████▉  | 2789/3507 [1:08:51<13:36,  1.14s/it]tensor([[-1.8359,  2.4688,  3.5781, -2.5156, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250, -2.2969,  1.9062,  5.7812, -0.2441]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:37,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.24 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.8906,  1.3359,  2.9375, -2.5938, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.3125, -0.1211,  2.7344, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -3.1250,  2.0312, -0.6250, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4688, -0.7109,  1.5156, -0.3789, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-8.5625, -5.4375,  0.7969, -0.4648, -6.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -4.6875, -0.8516,  2.3906, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:39,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.24 | optimizer_step: 0.35
[2025-11-06 18:53:39,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.00 | bwd_microstep: 1400.59 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1399.50 | step_microstep: 2.89
[2025-11-06 18:53:39,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.25 | bwd: 1401.56 | bwd_inner: 1.83 | bwd_allreduce: 1399.56 | step: 2.98
 80%|███████▉  | 2790/3507 [1:08:53<16:38,  1.39s/it]                                                     {'loss': 0.4259, 'learning_rate': 2.1142719154538526e-06, 'epoch': 0.8}
 80%|███████▉  | 2790/3507 [1:08:53<16:38,  1.39s/it]tensor([[-5.6250, -3.4688,  0.5234,  0.2256, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:39,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.09 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3750, -0.5234,  3.6875,  2.4531, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0938, -6.8125, -2.2656,  1.6797, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4375, -5.5312, -1.3047,  3.1875, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[h264 @ 0x86e3bc0] mmco: unref short failure
[h264 @ 0x86e3bc0] mmco: unref short failure
tensor([[-6.3750, -6.0625, -1.7812,  2.0938, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750,  0.6641,  3.7188, -1.3906, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -5.5312, -1.2266,  1.7031, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5625, -6.7812, -2.0938,  0.9453, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:41,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:53:41,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.73 | bwd_microstep: 1111.57 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1110.46 | step_microstep: 1.96
[2025-11-06 18:53:41,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.83 | bwd: 1112.41 | bwd_inner: 1.73 | bwd_allreduce: 1110.51 | step: 2.05
 80%|███████▉  | 2791/3507 [1:08:54<16:55,  1.42s/it]                                                     {'loss': 0.1339, 'learning_rate': 2.1085949060360654e-06, 'epoch': 0.8}
 80%|███████▉  | 2791/3507 [1:08:54<16:55,  1.42s/it]tensor([[-5.6875, -5.6562, -1.4766,  2.8281, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[ 0.7578,  4.5938,  5.1875,  0.0918, -0.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -5.0625, -1.5703,  0.0270, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:41,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.46 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.4688, -5.0312,  0.6445,  2.7031, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1875, -3.8906,  1.2656,  1.3438, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -3.3750,  0.1846,  1.7734, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -3.3281,  1.1719,  2.6250, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -6.9062, -3.8750,  0.8125, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:42,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:53:42,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.47 | bwd_microstep: 1241.49 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 1240.47 | step_microstep: 2.18
[2025-11-06 18:53:42,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.96 | bwd: 1242.31 | bwd_inner: 1.67 | bwd_allreduce: 1240.52 | step: 2.26
 80%|███████▉  | 2792/3507 [1:08:56<17:43,  1.49s/it]                                                     {'loss': 0.6465, 'learning_rate': 2.102924630117097e-06, 'epoch': 0.8}
 80%|███████▉  | 2792/3507 [1:08:56<17:43,  1.49s/it]tensor([[-6.0312, -4.8125,  0.4590,  2.9375, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6562, -1.2891,  2.6094,  1.6641, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -3.7969,  1.0469,  2.1250, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -5.5312, -1.6875,  2.5781, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)[2025-11-06 18:53:42,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.45 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
 tensor([3], device='cuda:2')
tensor([[-2.0469,  1.7031,  2.0000, -2.8438, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.8750,  1.1562,  4.4688, -0.1514, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9062, -3.4375,  2.3438,  2.2188, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6250,  1.1797,  3.9531, -2.8125, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:53:43,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:53:43,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.77 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.79 | step_microstep: 4.94
[2025-11-06 18:53:43,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 433.25 | bwd: 2.34 | bwd_inner: 1.38 | bwd_allreduce: 0.82 | step: 5.02
 80%|███████▉  | 2793/3507 [1:08:56<14:07,  1.19s/it]                                                     {'loss': 0.6413, 'learning_rate': 2.09726109253523e-06, 'epoch': 0.8}
 80%|███████▉  | 2793/3507 [1:08:56<14:07,  1.19s/it]tensor([[-3.7656, -0.8594,  1.7891, -0.4570, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -1.8516,  2.7969, -0.2158, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:43,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.77 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.7188, -4.5938,  0.0126,  2.0469, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -4.1875,  0.3047,  1.6562, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -4.9375, -0.1167,  2.5000, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812,  0.0053,  4.3125, -1.6016, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8125,  1.7891,  4.0000, -1.9922, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -4.0625,  1.7578,  2.5156, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:53:45,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:53:45,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.95 | bwd_microstep: 1229.71 | bwd_inner_microstep: 1.97 | bwd_allreduce_microstep: 1227.63 | step_microstep: 2.21
[2025-11-06 18:53:45,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 430.77 | bwd: 1230.35 | bwd_inner: 2.53 | bwd_allreduce: 1227.67 | step: 2.28
 80%|███████▉  | 2794/3507 [1:08:59<17:10,  1.45s/it]                                                     {'loss': 0.1473, 'learning_rate': 2.09160429812299e-06, 'epoch': 0.8}
 80%|███████▉  | 2794/3507 [1:08:59<17:10,  1.45s/it]tensor([[-7.8125, -5.2500,  1.0078,  1.0859, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -5.4062, -0.4707,  1.7422, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:45,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 333.25 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.57 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.18
tensor([[-5.2188, -3.2031,  0.8906,  1.2188, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -2.1562,  2.6250, -0.6016, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1719,  0.6055,  3.1719, -1.3203, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4375, -4.2500, -2.3906,  1.8906, -0.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -3.6250,  0.6172,  1.6250, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -4.9062, -1.5938,  2.7031, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:53:46,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.21 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:53:46,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.58 | bwd_microstep: 656.84 | bwd_inner_microstep: 2.48 | bwd_allreduce_microstep: 654.17 | step_microstep: 3.54
[2025-11-06 18:53:46,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 632.94 | bwd: 658.70 | bwd_inner: 4.12 | bwd_allreduce: 654.25 | step: 3.74
 80%|███████▉  | 2795/3507 [1:09:00<16:51,  1.42s/it]                                                     {'loss': 0.2451, 'learning_rate': 2.0859542517071452e-06, 'epoch': 0.8}
 80%|███████▉  | 2795/3507 [1:09:00<16:51,  1.42s/it]tensor([[-4.7500, -5.1250, -2.2188,  2.1406, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:46,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.17 | bwd_microstep: 1.34 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1562, -0.2129,  2.8125, -2.0469, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -4.0625, -0.3906,  2.5781, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4062, -5.0000,  1.1562,  1.4531, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.9688, -5.5625,  0.3438,  2.6406, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0156,  1.8828,  2.9531, -1.9531, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -1.6094,  2.4688, -1.2969, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -2.2969,  1.4766, -0.9570, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:53:48,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.31 | optimizer_step: 0.42
[2025-11-06 18:53:48,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.10 | bwd_microstep: 1621.28 | bwd_inner_microstep: 2.36 | bwd_allreduce_microstep: 1618.78 | step_microstep: 3.25
[2025-11-06 18:53:48,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.29 | bwd: 1622.61 | bwd_inner: 3.57 | bwd_allreduce: 1618.85 | step: 3.34
 80%|███████▉  | 2796/3507 [1:09:02<19:55,  1.68s/it]                                                     {'loss': 0.4625, 'learning_rate': 2.080310958108709e-06, 'epoch': 0.8}
 80%|███████▉  | 2796/3507 [1:09:02<19:55,  1.68s/it]tensor([[-5.3125, -4.2500, -0.5430,  1.1484, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:49,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.60 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.6250, -4.8750,  0.4277,  2.0156, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -2.9844, -0.0500,  0.4512, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.0938,  0.8555,  0.3535, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -1.6953,  2.9219,  0.6406, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.1875,  0.5312,  0.5508, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5312, -5.0000, -0.5234,  0.8984, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.6094,  2.0781,  3.0312, -1.2734, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:53:49,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:53:49,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.45 | bwd_microstep: 98.42 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 97.38 | step_microstep: 2.89
[2025-11-06 18:53:49,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.06 | bwd: 99.34 | bwd_inner: 1.77 | bwd_allreduce: 97.42 | step: 2.98
 80%|███████▉  | 2797/3507 [1:09:03<15:39,  1.32s/it]                                                     {'loss': 0.6306, 'learning_rate': 2.0746744221429393e-06, 'epoch': 0.8}
 80%|███████▉  | 2797/3507 [1:09:03<15:39,  1.32s/it]tensor([[-6.6562, -6.0000, -1.1016,  2.2656, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:49,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.71 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2656,  0.8516,  3.3750, -1.8516, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000,  1.5703,  3.0312, -1.9922, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8125, -1.1328,  3.7031, -1.7656, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -4.8750, -1.7578,  1.7109, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0625, -1.1328,  3.2188, -2.9688, -6.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -3.6094,  0.5938,  1.0547, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3438, -4.1250,  0.6523,  0.7266, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:53:51,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.21 | optimizer_step: 0.24
[2025-11-06 18:53:51,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.91 | bwd_microstep: 1540.47 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1539.47 | step_microstep: 3.36
[2025-11-06 18:53:51,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.65 | bwd: 1541.34 | bwd_inner: 1.67 | bwd_allreduce: 1539.52 | step: 3.43
 80%|███████▉  | 2798/3507 [1:09:05<17:56,  1.52s/it]                                                     {'loss': 0.207, 'learning_rate': 2.0690446486193227e-06, 'epoch': 0.8}
 80%|███████▉  | 2798/3507 [1:09:05<17:56,  1.52s/it]tensor([[-5.9062, -1.4375,  3.4375, -1.3281, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3281,  0.0879,  3.0156, -0.6172, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:51,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.01 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.7500, -6.7500, -2.2188,  0.1963, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -0.2793,  4.0625,  0.0928, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -2.8750,  0.7695,  1.7031, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3438, -0.2021,  2.5469, -0.3340, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9062, -5.0938,  0.0552,  1.1641, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -4.1250, -0.9180,  2.6094, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:53:51,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.16 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:53:51,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.82 | bwd_microstep: 157.82 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 156.69 | step_microstep: 3.20
[2025-11-06 18:53:51,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.83 | bwd: 158.71 | bwd_inner: 1.84 | bwd_allreduce: 156.73 | step: 3.30
 80%|███████▉  | 2799/3507 [1:09:05<14:18,  1.21s/it]                                                     {'loss': 0.383, 'learning_rate': 2.0634216423415766e-06, 'epoch': 0.8}
 80%|███████▉  | 2799/3507 [1:09:05<14:18,  1.21s/it]tensor([[-5.0000, -4.3125, -0.9961,  1.3594, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:53:51,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.06 | bwd_microstep: 1.62 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15
tensor([[-2.5625,  1.7266,  3.7188, -1.9688, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0469, -2.7188, -1.7031,  1.9297,  0.0500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -4.2500, -0.5938,  2.4844, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4375, -3.7031,  2.0000,  1.3594, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -3.0000,  2.5312,  2.7656, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -3.5781,  0.4512,  0.8750, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5625, -3.7031,  2.4062, -0.6719, -6.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:53:53,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.26 | optimizer_step: 0.22
[2025-11-06 18:53:53,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.48 | bwd_microstep: 1351.83 | bwd_inner_microstep: 2.26 | bwd_allreduce_microstep: 1349.40 | step_microstep: 2.31
[2025-11-06 18:53:53,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.58 | bwd: 1353.44 | bwd_inner: 3.70 | bwd_allreduce: 1349.47 | step: 2.46
 80%|███████▉  | 2800/3507 [1:09:07<16:04,  1.36s/it]                                                     {'loss': 0.3404, 'learning_rate': 2.0578054081076347e-06, 'epoch': 0.8}
 80%|███████▉  | 2800/3507 [1:09:07<16:04,  1.36s/it]tensor([[-1.7969,  2.7344,  3.5781, -2.6719, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:53,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.78 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8594, -1.3438,  0.8633, -1.0391, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -5.4375, -0.7695,  3.0469, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9531, -2.9375, -2.4531,  1.2891,  0.1885]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -2.9844,  1.4375,  2.0469, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3125,  0.9453,  2.1250, -0.8047, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5000, -2.3906,  2.7812,  1.1250, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.7500,  1.6719,  1.4375, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:53:54,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.16 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:53:54,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.42 | bwd_microstep: 637.40 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 636.21 | step_microstep: 3.22
[2025-11-06 18:53:54,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.23 | bwd: 638.38 | bwd_inner: 2.00 | bwd_allreduce: 636.25 | step: 3.30
 80%|███████▉  | 2801/3507 [1:09:08<14:46,  1.26s/it]                                                     {'loss': 0.6692, 'learning_rate': 2.0521959507096712e-06, 'epoch': 0.8}
 80%|███████▉  | 2801/3507 [1:09:08<14:46,  1.26s/it]tensor([[-6.1875, -1.9062,  2.5781, -1.9766, -5.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:54,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.77 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7500, -2.9688,  2.5781,  0.0089, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5000, -3.1719, -0.2734,  4.5938,  0.1553]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -6.4375, -1.8359,  3.0625, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5000, -5.5625,  0.6602,  2.1562, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8750, -3.3594, -0.9297,  3.3594, -0.3535]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -3.6094,  0.8672,  2.4844, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.8594,  0.3457,  2.9844, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:53:56,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.22 | optimizer_step: 0.22
[2025-11-06 18:53:56,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.34 | bwd_microstep: 1642.55 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1641.55 | step_microstep: 2.24
[2025-11-06 18:53:56,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.09 | bwd: 1643.39 | bwd_inner: 1.62 | bwd_allreduce: 1641.61 | step: 2.33
 80%|███████▉  | 2802/3507 [1:09:10<17:41,  1.51s/it]                                                     {'loss': 0.0796, 'learning_rate': 2.046593274934062e-06, 'epoch': 0.8}
 80%|███████▉  | 2802/3507 [1:09:10<17:41,  1.51s/it]tensor([[-7.0000, -5.2812, -0.5547,  0.5977, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -1.0938,  3.0469,  0.7578, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -4.5938, -1.4531,  2.1719, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9688,  1.4531,  4.5625, -1.2500, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:56,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 286.99 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2500, -3.0781,  1.8984,  2.2500, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -5.2812, -0.9648,  1.4609, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -2.8594, -0.0278, -2.2969, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.0781,  2.0781,  1.6406, -2.1875, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:53:57,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.82 | optimizer_gradients: 0.15 | optimizer_step: 0.27
[2025-11-06 18:53:57,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.50 | bwd_microstep: 2.44 | bwd_inner_microstep: 1.36 | bwd_allreduce_microstep: 0.98 | step_microstep: 3.23
[2025-11-06 18:53:57,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 513.53 | bwd: 3.49 | bwd_inner: 2.33 | bwd_allreduce: 1.02 | step: 3.32
 80%|███████▉  | 2803/3507 [1:09:11<14:20,  1.22s/it]                                                     {'loss': 0.6737, 'learning_rate': 2.040997385561405e-06, 'epoch': 0.8}
 80%|███████▉  | 2803/3507 [1:09:11<14:20,  1.22s/it]tensor([[-4.6562, -3.4531,  0.9141,  2.7969, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8125, -5.7812, -0.2637,  2.7500, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938,  0.2432,  3.3125, -2.1562, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.5469,  0.1914, -0.1206, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:57,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.88 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.2031, -0.2227,  2.3281, -0.3340, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -3.1094,  2.5469,  3.1250, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.0625,  0.4844,  3.1562, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -4.5000, -1.4844,  2.9844, -1.2891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:53:59,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.19 | optimizer_step: 0.23
[2025-11-06 18:53:59,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.31 | bwd_microstep: 996.16 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 995.17 | step_microstep: 3.65
[2025-11-06 18:53:59,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.20 | bwd: 996.99 | bwd_inner: 1.61 | bwd_allreduce: 995.22 | step: 3.77
 80%|███████▉  | 2804/3507 [1:09:12<16:44,  1.43s/it]                                                     {'loss': 0.5278, 'learning_rate': 2.0354082873665015e-06, 'epoch': 0.8}
 80%|███████▉  | 2804/3507 [1:09:12<16:44,  1.43s/it]tensor([[-0.4668,  2.4375,  2.0938, -1.2734, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.5000, -4.7188, -0.1553,  0.9023, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -3.9531,  0.2891,  1.6953, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4531,  0.5508,  3.1562, -1.7344, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:59,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.67 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -3.7344,  0.8594,  1.0547, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.7500, -5.0625,  1.3828,  1.2500, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -4.2500,  0.4512,  3.0938, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -4.2500, -0.7188,  1.6406, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:53:59,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.34 | optimizer_step: 0.27
[2025-11-06 18:53:59,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.86 | bwd_microstep: 2.49 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1.11 | step_microstep: 2.97
[2025-11-06 18:53:59,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.58 | bwd: 3.45 | bwd_inner: 2.09 | bwd_allreduce: 1.17 | step: 3.05
 80%|███████▉  | 2805/3507 [1:09:13<13:40,  1.17s/it]                                                     {'loss': 0.6476, 'learning_rate': 2.0298259851183633e-06, 'epoch': 0.8}
 80%|███████▉  | 2805/3507 [1:09:13<13:40,  1.17s/it]tensor([[-5.5938, -4.5625, -0.1758,  2.2500, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -2.7500,  1.6172,  1.6719, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -4.3438, -0.1123,  1.1797, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969, -1.9922, -0.4336,  1.9297, -0.6289]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:00,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.82 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-1.3984,  0.8281,  4.4688,  3.7500, -0.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -2.7969,  2.5938,  0.8594, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9688, -6.2812, -1.3281,  2.1562, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -5.1250, -1.9844,  1.5312, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:54:01,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:54:01,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 222.12 | bwd_microstep: 779.67 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 778.11 | step_microstep: 2.23
[2025-11-06 18:54:01,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.97 | bwd: 780.56 | bwd_inner: 2.19 | bwd_allreduce: 778.18 | step: 2.33
 80%|████████  | 2806/3507 [1:09:15<15:23,  1.32s/it]                                                     {'loss': 0.4029, 'learning_rate': 2.0242504835802e-06, 'epoch': 0.8}
 80%|████████  | 2806/3507 [1:09:15<15:23,  1.32s/it]tensor([[-2.4219, -2.7500, -0.3203,  3.3750, -0.1895]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0625, -4.2812,  0.2334, -0.9766, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:01,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.14 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.2188,  0.8711,  2.7500, -2.3594, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -4.6562, -0.7930,  1.6719, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656e+00,  7.2098e-04,  1.9766e+00, -2.0938e+00, -3.9844e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.9688, -1.3359,  2.7656, -2.9688, -6.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -3.8594,  0.2188,  2.2344, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -2.4219,  1.8906, -1.9922, -5.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:54:03,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.30 | optimizer_step: 0.26
[2025-11-06 18:54:03,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.82 | bwd_microstep: 1370.36 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1369.18 | step_microstep: 2.70
[2025-11-06 18:54:03,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.99 | bwd: 1371.20 | bwd_inner: 1.80 | bwd_allreduce: 1369.23 | step: 2.76
 80%|████████  | 2807/3507 [1:09:16<16:45,  1.44s/it]                                                     {'loss': 0.3605, 'learning_rate': 2.01868178750942e-06, 'epoch': 0.8}
 80%|████████  | 2807/3507 [1:09:16<16:45,  1.44s/it]tensor([[-4.8438, -4.8125, -0.9922,  3.0469, -1.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9062, -5.9375, -0.1934,  3.1562, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -2.2812,  2.6406,  0.2461, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -4.6562,  0.3633,  2.0625, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -3.2500,  2.5156,  1.4844, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1250, -3.4375, -1.8047,  1.5391, -0.8555]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:54:03,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.97 | bwd_microstep: 1.38 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1562, -3.0938,  1.0547, -0.8438, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -2.9375,  0.9688,  0.6445, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:04,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.34 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:54:04,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.59 | bwd_microstep: 4.67 | bwd_inner_microstep: 3.72 | bwd_allreduce_microstep: 0.85 | step_microstep: 4.94
[2025-11-06 18:54:04,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.58 | bwd: 6.03 | bwd_inner: 4.95 | bwd_allreduce: 0.89 | step: 5.02
 80%|████████  | 2808/3507 [1:09:17<15:17,  1.31s/it]                                                     {'loss': 0.6976, 'learning_rate': 2.013119901657624e-06, 'epoch': 0.8}
 80%|████████  | 2808/3507 [1:09:17<15:17,  1.31s/it]tensor([[-3.6875, -2.6719,  1.4844,  3.9062, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5938, -5.9375,  0.1660,  2.2500, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:04,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.01 | bwd_microstep: 1.53 | bwd_inner_microstep: 1.34 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.3125,  0.2168,  3.6406, -2.0156, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -5.9688, -1.0000,  3.3438, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.0156,  2.0156,  0.7734, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -4.6875, -0.7148,  1.9922, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -4.3750,  1.0859, -0.0640, -5.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7500,  1.2734,  3.2969, -1.3281, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:54:05,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.71 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:54:05,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.74 | bwd_microstep: 246.85 | bwd_inner_microstep: 1.61 | bwd_allreduce_microstep: 245.06 | step_microstep: 8.68
[2025-11-06 18:54:05,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.76 | bwd: 248.37 | bwd_inner: 2.96 | bwd_allreduce: 245.13 | step: 8.77
 80%|████████  | 2809/3507 [1:09:19<16:20,  1.41s/it]                                                     {'loss': 0.3881, 'learning_rate': 2.0075648307705986e-06, 'epoch': 0.8}
 80%|████████  | 2809/3507 [1:09:19<16:20,  1.41s/it]tensor([[-3.9062, -2.7500,  1.5469,  3.4688, -1.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4375,  0.7266,  3.7656, -1.5859, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -2.7969,  1.4453,  0.7422, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -1.8828,  1.9062, -1.2109, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -1.0078,  3.5781, -1.3594, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -4.3438, -0.6523,  2.3906, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:54:07,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.02 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-6.3125, -3.9062,  2.1875,  2.4375, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4375,  0.4062,  3.0469, -1.2734, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:07,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 18:54:07,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.58 | bwd_microstep: 3.74 | bwd_inner_microstep: 2.63 | bwd_allreduce_microstep: 0.99 | step_microstep: 2.51
[2025-11-06 18:54:07,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 483.55 | bwd: 4.95 | bwd_inner: 3.68 | bwd_allreduce: 1.05 | step: 2.61
 80%|████████  | 2810/3507 [1:09:21<17:03,  1.47s/it]                                                     {'loss': 0.8939, 'learning_rate': 2.0020165795883285e-06, 'epoch': 0.8}
 80%|████████  | 2810/3507 [1:09:21<17:03,  1.47s/it]tensor([[-5.8125, -1.8906,  2.7344, -0.9062, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -2.1250,  2.1250,  1.7188, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -5.1562, -1.5781,  1.6875, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:07,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.69 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6875, -2.6719,  1.0469,  1.1406, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0625, -4.8438,  1.0547,  1.6797, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -1.6094,  2.9375,  0.2148, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5625, -4.4375,  1.9141,  0.7422, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5156,  0.9102,  2.7969, -0.7891, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:54:09,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:54:09,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.14 | bwd_microstep: 137.27 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 136.39 | step_microstep: 4.08
[2025-11-06 18:54:09,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.85 | bwd: 137.95 | bwd_inner: 1.36 | bwd_allreduce: 136.43 | step: 4.16
 80%|████████  | 2811/3507 [1:09:23<18:44,  1.62s/it]                                                     {'loss': 0.6305, 'learning_rate': 1.996475152844961e-06, 'epoch': 0.8}
 80%|████████  | 2811/3507 [1:09:23<18:44,  1.62s/it]tensor([[-4.1562, -1.0469,  1.9062, -0.9102, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.2500, -3.8438,  1.1719,  1.1094, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:09,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.12 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-6.1250, -3.5312,  0.3340, -0.5820, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -4.8750,  1.3906,  1.8203, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5469,  1.9453,  1.4062, -0.9805, -1.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.7812, -2.3594,  2.4844,  0.0474, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7344, -1.5156,  1.5859,  0.6172, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -5.1562, -1.8438,  1.5547, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:11,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.22 | optimizer_step: 0.32
[2025-11-06 18:54:11,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.74 | bwd_microstep: 1907.57 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 1906.71 | step_microstep: 3.05
[2025-11-06 18:54:11,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.88 | bwd: 1908.43 | bwd_inner: 1.49 | bwd_allreduce: 1906.77 | step: 3.15
 80%|████████  | 2812/3507 [1:09:25<22:27,  1.94s/it]                                                     {'loss': 0.7085, 'learning_rate': 1.990940555268829e-06, 'epoch': 0.8}
 80%|████████  | 2812/3507 [1:09:25<22:27,  1.94s/it]tensor([[-3.7812, -1.2656,  2.1094,  1.1875, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:12,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.21 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.8750, -5.1250, -0.9727, -0.1729, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -2.6094,  2.7812,  0.4141, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -2.3438,  2.9062,  0.3281, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -0.8789,  2.3438, -1.7969, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -1.7734,  2.0781,  0.2598, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -5.5000, -3.4688,  1.1562, -1.6484]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4688,  1.7969,  3.3906, -2.3750, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:54:12,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.32 | optimizer_step: 0.24
[2025-11-06 18:54:12,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.62 | bwd_microstep: 183.78 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 182.76 | step_microstep: 3.35
[2025-11-06 18:54:12,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.83 | bwd: 184.63 | bwd_inner: 1.58 | bwd_allreduce: 182.84 | step: 3.45
 80%|████████  | 2813/3507 [1:09:26<17:43,  1.53s/it]                                                     {'loss': 0.4041, 'learning_rate': 1.9854127915824427e-06, 'epoch': 0.8}
 80%|████████  | 2813/3507 [1:09:26<17:43,  1.53s/it]tensor([[-6.3125, -4.6562,  0.2471,  1.4375, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -0.8242,  2.1094, -1.4141, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:12,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.93 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-7.0625, -5.8438, -0.8398,  1.4062, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -2.4219,  1.4297,  0.0092, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312,  0.9219,  3.4219, -2.4531, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625,  0.0640,  3.4688, -0.3281, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2031,  1.6719,  3.1562, -1.8594, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0938,  2.2188,  3.6250, -2.2969, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:14,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.44 | optimizer_step: 0.42
[2025-11-06 18:54:14,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.13 | bwd_microstep: 1171.82 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 1170.27 | step_microstep: 4.68
[2025-11-06 18:54:14,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.08 | bwd: 1172.87 | bwd_inner: 2.26 | bwd_allreduce: 1170.36 | step: 4.76
 80%|████████  | 2814/3507 [1:09:28<19:50,  1.72s/it]                                                     {'loss': 0.1591, 'learning_rate': 1.979891866502476e-06, 'epoch': 0.8}
 80%|████████  | 2814/3507 [1:09:28<19:50,  1.72s/it]tensor([[-3.5938, -0.7109,  2.5000,  0.7695, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -5.0625, -1.9297,  2.2500, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6875, -3.5469,  0.1289,  4.0000, -1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0312, -4.0938,  0.9688, -0.6250, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -4.3125,  0.6094,  1.2500, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2812,  1.5547,  4.4062, -2.4531, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -5.9688, -2.0156,  1.2734, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:15,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.06 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -2.2812,  1.9062,  0.7852, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:15,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 18:54:15,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 305.74 | bwd_microstep: 1.83 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.84 | step_microstep: 3.97
[2025-11-06 18:54:15,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 465.78 | bwd: 2.65 | bwd_inner: 1.60 | bwd_allreduce: 0.88 | step: 4.07
 80%|████████  | 2815/3507 [1:09:29<18:18,  1.59s/it]                                                     {'loss': 0.1603, 'learning_rate': 1.9743777847397672e-06, 'epoch': 0.8}
 80%|████████  | 2815/3507 [1:09:29<18:18,  1.59s/it]tensor([[-4.7188, -0.2344,  3.6562, -1.8281, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:16,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.25 | bwd_microstep: 1.58 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.14
tensor([[-2.3750,  1.9297,  3.7656, -2.0156, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -0.1455,  3.3594, -0.3203, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -5.4062, -1.4219,  2.0781, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -2.4375,  1.5000,  0.9375, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[5.8125, 7.6562, 7.7500, 6.0625, 4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4062, -4.6250,  1.1094,  2.5000, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -3.6719,  1.0703,  3.2812, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:18,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:54:18,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 1067.89 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 1066.61 | step_microstep: 2.60
[2025-11-06 18:54:18,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.11 | bwd: 1069.49 | bwd_inner: 2.46 | bwd_allreduce: 1066.75 | step: 2.76
 80%|████████  | 2816/3507 [1:09:32<20:25,  1.77s/it]                                                     {'loss': 0.2365, 'learning_rate': 1.9688705509993155e-06, 'epoch': 0.8}
 80%|████████  | 2816/3507 [1:09:32<20:25,  1.77s/it]tensor([[-5.7812, -2.5312,  0.8828, -1.8594, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1094, -2.3906,  0.8164,  3.0000, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:18,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.60 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.2031, -0.1650,  2.0312, -0.7344, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4062,  1.4297,  2.7500, -2.0625, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -2.8594,  2.0625,  1.7734, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -3.1562,  0.9023,  2.4062, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9688, -3.5156,  2.6719,  0.5820, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4375, -5.1250, -0.6602, -1.0547, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:54:19,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:54:19,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.64 | bwd_microstep: 522.17 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 520.97 | step_microstep: 7.11
[2025-11-06 18:54:19,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.26 | bwd: 523.15 | bwd_inner: 1.99 | bwd_allreduce: 521.02 | step: 7.20
 80%|████████  | 2817/3507 [1:09:32<17:13,  1.50s/it]                                                     {'loss': 0.3073, 'learning_rate': 1.9633701699802808e-06, 'epoch': 0.8}
 80%|████████  | 2817/3507 [1:09:32<17:13,  1.50s/it]tensor([[-3.3594,  0.3340,  2.8750, -1.0391, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:19,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.73 | bwd_microstep: 4.44 | bwd_inner_microstep: 4.32 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3438, -3.4688, -0.2969,  3.6875, -0.8008]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -2.9219,  1.4141,  3.2812, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -0.1079,  3.5625, -1.2891, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.6250, -5.0938,  0.9023,  1.0312, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -3.3125,  1.3438,  1.9375, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -4.0625, -2.2969,  1.8047, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -4.1562, -0.8242,  2.1719, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:21,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.49 | optimizer_step: 0.58
[2025-11-06 18:54:21,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.40 | bwd_microstep: 2341.73 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2340.54 | step_microstep: 4.05
[2025-11-06 18:54:21,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.14 | bwd: 2346.19 | bwd_inner: 5.34 | bwd_allreduce: 2340.64 | step: 4.13
 80%|████████  | 2818/3507 [1:09:35<21:15,  1.85s/it]                                                     {'loss': 0.2662, 'learning_rate': 1.95787664637597e-06, 'epoch': 0.8}
 80%|████████  | 2818/3507 [1:09:35<21:15,  1.85s/it]tensor([[-6.7188, -2.5938,  1.3359, -2.9062, -6.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -0.8438,  2.0938, -1.7578, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -2.3438,  3.1562,  0.0820, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:21,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.79 | bwd_microstep: 2.15 | bwd_inner_microstep: 1.84 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.21
tensor([[-5.1250, -2.8281,  1.3125,  1.2188, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -2.7812,  2.9375,  0.5117, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938,  0.3047,  3.0625, -0.6289, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.4668,  2.3281,  2.7656, -0.6797, -1.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.4688, -3.3125,  1.7734,  2.0469, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:54:22,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.43 | optimizer_step: 0.41
[2025-11-06 18:54:22,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.35 | bwd_microstep: 252.41 | bwd_inner_microstep: 2.91 | bwd_allreduce_microstep: 249.29 | step_microstep: 4.09
[2025-11-06 18:54:22,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.20 | bwd: 254.56 | bwd_inner: 4.79 | bwd_allreduce: 249.43 | step: 4.31
 80%|████████  | 2819/3507 [1:09:36<17:12,  1.50s/it]                                                     {'loss': 0.6567, 'learning_rate': 1.9523899848738435e-06, 'epoch': 0.8}
 80%|████████  | 2819/3507 [1:09:36<17:12,  1.50s/it]tensor([[-4.0938, -0.2139,  2.7031, -1.7422, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.3438, -5.7812, -0.0281,  2.0469, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0312, -4.8125,  0.7852,  1.2422, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:22,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.27 | bwd_microstep: 2.29 | bwd_inner_microstep: 1.90 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.23
tensor([[-3.7188, -3.3438,  0.1895,  3.3281, -1.4297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.8125, -6.9062, -0.5547,  1.0703, -5.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -4.0312, -0.8438,  3.3438, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5469,  1.4453,  3.7344,  0.7734, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -4.3125, -0.5781,  3.0938, -1.8516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:24,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:54:24,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 212.50 | bwd_microstep: 1934.62 | bwd_inner_microstep: 1.81 | bwd_allreduce_microstep: 1932.64 | step_microstep: 2.18
[2025-11-06 18:54:24,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 509.79 | bwd: 1936.87 | bwd_inner: 3.74 | bwd_allreduce: 1932.75 | step: 2.42
 80%|████████  | 2820/3507 [1:09:38<20:37,  1.80s/it]                                                     {'loss': 0.5323, 'learning_rate': 1.9469101901555045e-06, 'epoch': 0.8}
 80%|████████  | 2820/3507 [1:09:38<20:37,  1.80s/it]tensor([[-6.4375, -4.7500,  0.3633,  1.8594, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:25,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.16 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.7500, -5.2188,  0.7188,  0.4805, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -0.8125,  3.1406, -1.8750, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -3.2656,  0.6133,  4.2188, -1.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -0.5898,  2.5625, -0.7539, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.3438, -4.9375, -1.0938,  2.3750, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7969,  1.5625,  2.7656, -1.1328, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.4688, -4.0312,  0.3105,  1.8203, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:25,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:54:25,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.46 | bwd_microstep: 99.97 | bwd_inner_microstep: 1.78 | bwd_allreduce_microstep: 98.08 | step_microstep: 2.05
[2025-11-06 18:54:25,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 250.65 | bwd: 100.84 | bwd_inner: 2.55 | bwd_allreduce: 98.11 | step: 2.14
 80%|████████  | 2821/3507 [1:09:39<15:43,  1.38s/it]                                                     {'loss': 0.9163, 'learning_rate': 1.9414372668966954e-06, 'epoch': 0.8}
 80%|████████  | 2821/3507 [1:09:39<15:43,  1.38s/it]tensor([[-5.1562, -1.4922,  1.9922, -1.5781, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:25,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.90 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6250, -2.5000,  1.8125,  2.0312, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -5.0000, -0.5547,  2.3594, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.2188,  0.0220,  1.5859, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -3.9375, -0.7891,  2.3125, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -5.9062, -1.7422,  2.4531, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -5.3125, -0.5703,  2.7344, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -1.7344,  1.8047, -1.3594, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:27,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:54:27,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.15 | bwd_microstep: 1823.07 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1822.21 | step_microstep: 1.65
[2025-11-06 18:54:27,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.09 | bwd: 1824.02 | bwd_inner: 1.60 | bwd_allreduce: 1822.25 | step: 1.74
 80%|████████  | 2822/3507 [1:09:41<18:39,  1.63s/it]                                                     {'loss': 0.1387, 'learning_rate': 1.9359712197672997e-06, 'epoch': 0.8}
 80%|████████  | 2822/3507 [1:09:41<18:39,  1.63s/it]tensor([[-2.7031,  0.2100,  1.9453, -0.7266, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.8125, -1.5078,  2.8438, -1.8281, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -4.1562, -0.4785,  2.1250, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -5.8750, -1.9375,  1.8281, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:27,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.33 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7969, -0.0654,  3.1875, -0.8945, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -1.1250,  1.9531, -0.1709, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3438,  0.9492,  3.1406, -2.5312, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.7500, -3.8906,  2.0156, -0.9453, -6.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:27,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:54:27,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 114.07 | bwd_microstep: 97.68 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 96.51 | step_microstep: 1.63
[2025-11-06 18:54:27,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.43 | bwd: 98.74 | bwd_inner: 2.06 | bwd_allreduce: 96.55 | step: 1.72
 80%|████████  | 2823/3507 [1:09:41<14:31,  1.27s/it]                                                     {'loss': 0.3105, 'learning_rate': 1.9305120534313295e-06, 'epoch': 0.8}
 80%|████████  | 2823/3507 [1:09:41<14:31,  1.27s/it]tensor([[-5.2812, -5.0625, -1.5859,  2.2656, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -1.7578,  2.8125, -0.6094, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -5.0625, -1.4766,  2.5469, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:28,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.57 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -0.3477,  4.0312, -1.5234, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0000, -4.0312,  1.3359,  2.2656, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.9375,  0.6367,  3.0156, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -4.3438,  1.5234,  0.2734, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8281,  1.8672,  2.7188, -1.9922, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:30,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.31 | optimizer_step: 0.44
[2025-11-06 18:54:30,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 84.78 | bwd_microstep: 1922.41 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1921.25 | step_microstep: 3.12
[2025-11-06 18:54:30,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 234.37 | bwd: 1923.25 | bwd_inner: 1.80 | bwd_allreduce: 1921.30 | step: 3.20
 81%|████████  | 2824/3507 [1:09:43<17:37,  1.55s/it]                                                     {'loss': 0.3035, 'learning_rate': 1.925059772546929e-06, 'epoch': 0.81}
 81%|████████  | 2824/3507 [1:09:43<17:37,  1.55s/it]tensor([[-1.8750,  1.8672,  2.9531, -1.8203, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:30,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.83 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.4375, -3.5000,  2.4375,  1.4531, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4844, -4.0312, -1.8359,  2.1250, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.5625, -7.5938, -0.6562,  1.1562, -6.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -3.7656,  1.8047,  1.3047, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -1.4531,  1.9453,  0.3984, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -4.8125, -1.5078,  2.2812, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625, -0.0187,  2.5156,  1.1250, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:54:30,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:54:30,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.39 | bwd_microstep: 156.85 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 155.64 | step_microstep: 1.74
[2025-11-06 18:54:30,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.23 | bwd: 157.84 | bwd_inner: 2.04 | bwd_allreduce: 155.67 | step: 1.81
 81%|████████  | 2825/3507 [1:09:44<14:10,  1.25s/it]                                                     {'loss': 0.9518, 'learning_rate': 1.9196143817663604e-06, 'epoch': 0.81}
 81%|████████  | 2825/3507 [1:09:44<14:10,  1.25s/it]tensor([[-6.0938, -3.4375,  1.7656,  1.1484, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -2.4219,  1.8203,  0.8086, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -5.0938, -0.3730,  1.3047, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:30,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.35 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5625, -5.3750, -1.4297,  2.3906, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -1.5859,  2.0781, -0.4824, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0781, -3.6094, -1.6172,  2.3281, -0.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9375, -2.2812,  1.1250,  3.7969, -0.9805]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.9062, -0.7656,  2.1406, -1.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:33,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.27 | optimizer_step: 0.29
[2025-11-06 18:54:33,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.97 | bwd_microstep: 2353.08 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 2351.80 | step_microstep: 2.81
[2025-11-06 18:54:33,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.35 | bwd: 2353.95 | bwd_inner: 1.95 | bwd_allreduce: 2351.86 | step: 2.90
 81%|████████  | 2826/3507 [1:09:47<19:17,  1.70s/it]                                                     {'loss': 0.1613, 'learning_rate': 1.9141758857360194e-06, 'epoch': 0.81}
 81%|████████  | 2826/3507 [1:09:47<19:17,  1.70s/it]tensor([[-6.6562, -4.5625,  0.7734,  1.4609, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -3.0469,  1.2188, -1.0156, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:54:33,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.48 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.1250, -5.4375, -0.4277,  3.0312, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -3.3438,  0.3027,  3.0625, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -2.8281,  2.6406,  1.6094, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0781, -2.8906, -1.6875,  2.5312,  0.1943]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-5.2188, -2.4688,  1.2969,  0.2100, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.3125, -6.3125, -0.1455,  1.1562, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:33,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:54:33,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.89 | bwd_microstep: 91.20 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 90.23 | step_microstep: 1.43
[2025-11-06 18:54:33,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.39 | bwd: 92.13 | bwd_inner: 1.75 | bwd_allreduce: 90.26 | step: 1.50
 81%|████████  | 2827/3507 [1:09:47<15:08,  1.34s/it]                                                     {'loss': 1.4111, 'learning_rate': 1.9087442890964102e-06, 'epoch': 0.81}
 81%|████████  | 2827/3507 [1:09:47<15:08,  1.34s/it]tensor([[-2.2969, -3.2188, -1.9375,  2.2656,  0.1035]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:34,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.24 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.7500, -5.9688, -2.4219,  2.0156, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -5.8438, -2.5938,  2.0625, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -3.0781,  2.0938,  0.7109, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9375, -0.8984,  0.4414, -2.3438, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1406,  1.4141,  3.8438, -2.5781, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.9688, -5.9688, -1.6250,  2.9844, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -1.1875,  2.1875,  1.7188, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:36,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:54:36,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.44 | bwd_microstep: 2425.89 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 2424.93 | step_microstep: 2.45
[2025-11-06 18:54:36,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.70 | bwd: 2426.67 | bwd_inner: 1.57 | bwd_allreduce: 2424.96 | step: 2.53
 81%|████████  | 2828/3507 [1:09:50<20:11,  1.78s/it]                                                     {'loss': 0.6956, 'learning_rate': 1.9033195964821438e-06, 'epoch': 0.81}
 81%|████████  | 2828/3507 [1:09:50<20:11,  1.78s/it]tensor([[-6.0625, -2.6406,  2.4531, -0.0289, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -5.0312, -0.9922,  3.6094, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -1.7188,  2.1719,  2.0156, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7344,  1.1797,  2.9219, -2.2031, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:36,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.5156, -3.7500, -0.8438,  3.2500, -0.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -4.9688, -1.5469,  2.5938, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -4.8438, -0.7305,  2.6719, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.6250, -0.0918,  2.5000, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:37,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:54:37,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.34 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.66 | step_microstep: 2.39
[2025-11-06 18:54:37,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.01 | bwd: 2.73 | bwd_inner: 1.92 | bwd_allreduce: 0.69 | step: 2.47
 81%|████████  | 2829/3507 [1:09:51<15:36,  1.38s/it]                                                     {'loss': 0.1349, 'learning_rate': 1.8979018125219551e-06, 'epoch': 0.81}
 81%|████████  | 2829/3507 [1:09:51<15:36,  1.38s/it]tensor([[-4.8125, -3.8281,  0.0933,  2.1562, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:37,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.99 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -0.6953,  2.6875, -0.4824, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -5.5625, -0.8828,  0.6797, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -4.5000,  0.2441,  3.1250, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7031, -3.6406, -2.0938,  2.3906, -0.1885]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -1.9844,  2.3281,  1.3516, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -2.1250,  1.2188, -0.3984, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -2.9375,  1.1484,  1.7344, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:39,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.31
[2025-11-06 18:54:39,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.29 | bwd_microstep: 2291.08 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 2289.98 | step_microstep: 2.22
[2025-11-06 18:54:39,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.30 | bwd: 2291.99 | bwd_inner: 1.82 | bwd_allreduce: 2290.02 | step: 2.30
 81%|████████  | 2830/3507 [1:09:53<19:52,  1.76s/it]                                                     {'loss': 0.4514, 'learning_rate': 1.892490941838674e-06, 'epoch': 0.81}
 81%|████████  | 2830/3507 [1:09:53<19:52,  1.76s/it]tensor([[-5.3438, -4.9688, -0.6328,  2.7969, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -3.7031, -0.3828,  2.8125, -1.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:40,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.82 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.6562, -3.5156, -0.5820, -3.1562, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -1.9453,  1.8984,  0.1699, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -4.5000,  0.0332,  1.3828, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -5.9062, -2.5781,  1.8203, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -1.6797,  3.1875,  0.2090, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -3.1094,  0.7148,  0.9648, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:54:40,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:54:40,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.00 | bwd_microstep: 89.57 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 88.51 | step_microstep: 1.53
[2025-11-06 18:54:40,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.82 | bwd: 90.41 | bwd_inner: 1.74 | bwd_allreduce: 88.54 | step: 1.61
 81%|████████  | 2831/3507 [1:09:54<15:14,  1.35s/it]                                                     {'loss': 0.1965, 'learning_rate': 1.8870869890492328e-06, 'epoch': 0.81}
 81%|████████  | 2831/3507 [1:09:54<15:14,  1.35s/it]tensor([[-5.8750, -4.3438, -0.0977,  1.0469, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -0.5430,  2.5938, -0.7734, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:40,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.61 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12
tensor([[-7.0625, -4.4375,  1.5234,  1.1641, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -4.3125,  0.0957,  1.8281, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -5.0938,  0.0149,  3.0469, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5625, -5.4375,  0.6875,  1.6250, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0625, -6.1250, -1.3203,  1.4531, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3125,  0.3203,  2.5156,  0.0869, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:42,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.21 | optimizer_step: 0.23
[2025-11-06 18:54:42,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.23 | bwd_microstep: 1811.57 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1810.35 | step_microstep: 2.76
[2025-11-06 18:54:42,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.87 | bwd: 1812.64 | bwd_inner: 1.91 | bwd_allreduce: 1810.41 | step: 2.86
 81%|████████  | 2832/3507 [1:09:56<18:00,  1.60s/it]                                                     {'loss': 0.6784, 'learning_rate': 1.8816899587646631e-06, 'epoch': 0.81}
 81%|████████  | 2832/3507 [1:09:56<18:00,  1.60s/it]tensor([[-5.0000, -4.2500,  0.3711,  3.4062, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.3125, -0.5859,  2.6875, -2.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.7031,  1.0859,  3.9688, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.0000, -4.5000,  1.5078, -0.6367, -6.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:42,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.56 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-6.8438, -4.1562,  1.7891,  1.3203, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -0.7422,  1.6875, -0.8008, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -4.7188,  0.5156,  2.8438, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7188, -3.1250,  0.8477, -0.1729, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:42,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:54:42,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.18 | bwd_microstep: 1.58 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.71 | step_microstep: 2.01
[2025-11-06 18:54:42,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 412.77 | bwd: 2.53 | bwd_inner: 1.61 | bwd_allreduce: 0.76 | step: 2.13
 81%|████████  | 2833/3507 [1:09:56<14:09,  1.26s/it]                                                     {'loss': 0.4588, 'learning_rate': 1.876299855590088e-06, 'epoch': 0.81}
 81%|████████  | 2833/3507 [1:09:56<14:09,  1.26s/it]tensor([[-3.9062,  0.2451,  3.1250, -1.9531, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.8750, -1.0938,  3.3438, -0.5391, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -2.1094,  1.8750,  1.7109, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:43,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.39 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4688, -2.0156,  2.7500,  0.0811, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -2.0781,  2.2031,  1.1484, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -0.5078,  1.2109, -2.8750, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4688, -4.4375,  0.0449,  0.5625, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -5.0312, -0.8398,  3.2969, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:46,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.28 | optimizer_step: 0.24
[2025-11-06 18:54:46,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.34 | bwd_microstep: 3237.59 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 3236.71 | step_microstep: 2.72
[2025-11-06 18:54:46,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.77 | bwd: 3238.22 | bwd_inner: 1.29 | bwd_allreduce: 3236.77 | step: 2.79
 81%|████████  | 2834/3507 [1:10:00<22:09,  1.98s/it]                                                     {'loss': 0.5838, 'learning_rate': 1.8709166841247206e-06, 'epoch': 0.81}
 81%|████████  | 2834/3507 [1:10:00<22:09,  1.98s/it]tensor([[-4.5312, -1.5469,  2.8438,  0.8828, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -2.9688,  1.5938,  0.6016, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219,  0.1001,  2.8438, -2.2500, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:46,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.29 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9062, -0.6719,  3.5312, -1.2656, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.9375, -1.6016,  1.3906,  2.5156, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -4.3750, -0.8320,  1.7344, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -3.2188,  1.9844, -0.7070, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4688,  0.5703,  2.5000, -0.5391, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:46,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:54:46,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.52 | bwd_microstep: 1.76 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.66 | step_microstep: 1.80
[2025-11-06 18:54:46,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.84 | bwd: 2.70 | bwd_inner: 1.89 | bwd_allreduce: 0.69 | step: 1.87
 81%|████████  | 2835/3507 [1:10:00<17:01,  1.52s/it]                                                     {'loss': 0.8011, 'learning_rate': 1.865540448961859e-06, 'epoch': 0.81}
 81%|████████  | 2835/3507 [1:10:00<17:01,  1.52s/it]tensor([[-5.1875, -4.6875, -0.8164,  2.3750, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -2.3125,  1.6016,  2.7031, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:47,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.39 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07
tensor([[-5.7188, -1.1797,  1.5703, -4.4375, -6.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8125, -3.1250,  1.0859, -0.0242, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344, -2.8125, -2.3594,  0.8477, -0.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -0.4492,  3.2656, -0.4688, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -2.3125,  2.8906, -0.1064, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2656,  1.6016,  3.3281, -1.5156, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:54:47,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:54:47,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.06 | bwd_microstep: 168.92 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 167.60 | step_microstep: 2.02
[2025-11-06 18:54:47,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.48 | bwd: 169.94 | bwd_inner: 2.17 | bwd_allreduce: 167.64 | step: 2.10
 81%|████████  | 2836/3507 [1:10:01<13:40,  1.22s/it]                                                     {'loss': 0.5683, 'learning_rate': 1.8601711546888844e-06, 'epoch': 0.81}
 81%|████████  | 2836/3507 [1:10:01<13:40,  1.22s/it]tensor([[-6.5625, -2.1406,  2.6719, -2.0156, -6.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812, -2.2969, -0.3691,  1.4375, -1.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3438, -3.0781,  0.4375,  3.7812, -1.0234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:47,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.23 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.5234,  0.9922,  1.7422, -0.1953, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3438, -3.9688,  1.9844,  0.0403, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9219, -3.9062, -0.6445,  3.0938, -1.3984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.7812, -1.6250, -0.0209, -2.7656, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([3], device='cuda:2')
tensor([[-5.2188, -3.8906,  0.6914,  2.6562, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:54:48,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.25
[2025-11-06 18:54:48,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.52 | bwd_microstep: 221.58 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 220.37 | step_microstep: 2.19
[2025-11-06 18:54:48,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.79 | bwd: 222.53 | bwd_inner: 1.98 | bwd_allreduce: 220.42 | step: 2.28
 81%|████████  | 2837/3507 [1:10:02<11:46,  1.05s/it]                                                     {'loss': 0.4172, 'learning_rate': 1.8548088058872504e-06, 'epoch': 0.81}
 81%|████████  | 2837/3507 [1:10:02<11:46,  1.05s/it]tensor([[-6.9062, -6.5938, -1.6562,  2.5938, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:48,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 137.00 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6250,  1.9062,  3.7188, -2.5625, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5938, -5.3438,  1.0234,  1.7266, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -3.7656,  0.4688,  3.1406, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -3.3281,  1.2812,  2.2500, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -2.4844,  2.1875, -0.5977, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9688, -6.2500, -1.3984,  1.8359, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.5938, -0.1611,  1.8594, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:49,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.18 | optimizer_step: 0.25
[2025-11-06 18:54:49,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.07 | bwd_microstep: 1244.06 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1243.21 | step_microstep: 2.34
[2025-11-06 18:54:49,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.07 | bwd: 1245.04 | bwd_inner: 1.67 | bwd_allreduce: 1243.25 | step: 2.42
 81%|████████  | 2838/3507 [1:10:03<14:14,  1.28s/it]                                                     {'loss': 0.3628, 'learning_rate': 1.8494534071324966e-06, 'epoch': 0.81}
 81%|████████  | 2838/3507 [1:10:03<14:14,  1.28s/it]tensor([[-4.5625, -3.0000,  0.9336,  1.9531, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6406, -3.1406,  0.3281,  3.1562, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:50,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.13 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062,  0.1270,  3.7500, -1.9531, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3594,  1.7656,  3.4375, -1.9609, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -0.1138,  4.1562, -2.1250, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -3.1562,  2.8438,  2.7188, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6719, -2.6875,  0.2158,  2.0312, -1.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -5.8125, -0.7188,  3.0938, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:54:51,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.23 | optimizer_step: 0.19
[2025-11-06 18:54:51,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.04 | bwd_microstep: 822.09 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 820.89 | step_microstep: 96.81
[2025-11-06 18:54:51,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.19 | bwd: 822.98 | bwd_inner: 1.90 | bwd_allreduce: 820.92 | step: 96.89
 81%|████████  | 2839/3507 [1:10:05<15:32,  1.40s/it]                                                     {'loss': 0.4079, 'learning_rate': 1.8441049629942164e-06, 'epoch': 0.81}
 81%|████████  | 2839/3507 [1:10:05<15:32,  1.40s/it]tensor([[-2.4688,  1.4297,  3.5625, -0.9805, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -5.3750, -0.4941,  2.7188, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:51,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.09 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-8.3750, -5.7812,  0.9531,  1.2578, -5.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -2.9062,  1.0703,  0.9336, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -3.3750,  1.6172,  3.2969, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -2.4531,  3.0156,  0.6289, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.7812, -2.3906,  3.0312, -1.3438, -6.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0625, -6.4375, -2.0781,  0.8555, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:54:54,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:54:54,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.82 | bwd_microstep: 2550.77 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2549.66 | step_microstep: 2.43
[2025-11-06 18:54:54,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 555.95 | bwd: 2551.86 | bwd_inner: 1.99 | bwd_allreduce: 2549.71 | step: 2.53
 81%|████████  | 2840/3507 [1:10:08<21:23,  1.92s/it]                                                     {'loss': 0.4785, 'learning_rate': 1.8387634780360774e-06, 'epoch': 0.81}
 81%|████████  | 2840/3507 [1:10:08<21:23,  1.92s/it]tensor([[-6.7812, -4.5000,  1.1250,  1.4688, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.7969,  1.1016,  2.4062, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.6250, -5.7500, -0.0352,  1.3672, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -5.8438, -2.6250,  2.3750, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:55,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7656, -4.4375, -1.8125,  2.9531, -0.8945]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -5.1875, -0.0086,  1.5859, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -2.9062,  1.1094,  2.7344, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5938, -5.5312,  0.2773,  1.2109, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:54:55,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.46 | optimizer_step: 0.41
[2025-11-06 18:54:55,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.72 | bwd_microstep: 101.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 100.93 | step_microstep: 4.09
[2025-11-06 18:54:55,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.19 | bwd: 102.67 | bwd_inner: 1.46 | bwd_allreduce: 101.01 | step: 4.17
 81%|████████  | 2841/3507 [1:10:09<16:39,  1.50s/it]                                                     {'loss': 0.7582, 'learning_rate': 1.833428956815807e-06, 'epoch': 0.81}
 81%|████████  | 2841/3507 [1:10:09<16:39,  1.50s/it]tensor([[-5.3438, -3.1250,  1.5625,  1.7109, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:55,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.72 | bwd_microstep: 1.80 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0000, -5.6875, -0.6523,  3.6562, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.3125, -5.4688,  0.9805,  0.5430, -6.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -3.8594,  0.7812,  0.2598, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -4.1250, -0.3691,  4.1562, -1.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0156,  0.8945,  2.2969, -2.6250, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5547,  2.6406,  3.9375, -1.5938, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.4688, -4.9375, -1.8828,  2.6719, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:56,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.27 | optimizer_step: 0.22
[2025-11-06 18:54:56,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.60 | bwd_microstep: 2.80 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 1.37 | step_microstep: 3.24
[2025-11-06 18:54:56,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.27 | bwd: 4.51 | bwd_inner: 2.78 | bwd_allreduce: 1.42 | step: 3.35
 81%|████████  | 2842/3507 [1:10:10<16:24,  1.48s/it]                                                     {'loss': 0.5137, 'learning_rate': 1.8281014038851963e-06, 'epoch': 0.81}
 81%|████████  | 2842/3507 [1:10:10<16:24,  1.48s/it]tensor([[-3.0781,  0.8789,  3.0938, -1.6016, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.7422, -0.0356,  0.9844,  0.4883, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -4.8750, -1.0625,  2.0469, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:54:57,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.38 | bwd_microstep: 1.26 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.17
tensor([[-2.5000, -0.3867,  2.6250,  1.7969, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -1.7344,  2.8281, -0.4199, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2188, -2.6094,  0.6992,  2.9062, -1.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -4.8125, -1.9922,  2.1875, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -0.1250,  3.4688, -1.5859, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:54:58,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.41 | optimizer_step: 0.45
[2025-11-06 18:54:58,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.40 | bwd_microstep: 1160.03 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 1158.38 | step_microstep: 4.08
[2025-11-06 18:54:58,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.83 | bwd: 1161.32 | bwd_inner: 2.53 | bwd_allreduce: 1158.51 | step: 4.26
 81%|████████  | 2843/3507 [1:10:12<17:02,  1.54s/it]                                                     {'loss': 0.1978, 'learning_rate': 1.822780823790088e-06, 'epoch': 0.81}
 81%|████████  | 2843/3507 [1:10:12<17:02,  1.54s/it]tensor([[-3.4062, -0.5195,  3.4844,  1.8672, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406,  0.4277,  2.9844,  0.4141, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -4.4688,  1.3047,  2.5625, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -6.0312, -3.1406,  1.8828, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -1.0078,  4.0625, -1.2344, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:54:58,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.73 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6719,  0.1367,  2.0781, -0.0503, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -2.8750,  2.3438, -0.5352, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -4.0938,  0.7383,  1.9766, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:00,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.89 | optimizer_gradients: 0.24 | optimizer_step: 0.25
[2025-11-06 18:55:00,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.98 | bwd_microstep: 3.05 | bwd_inner_microstep: 1.65 | bwd_allreduce_microstep: 1.25 | step_microstep: 6.91
[2025-11-06 18:55:00,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 502.76 | bwd: 3.70 | bwd_inner: 2.22 | bwd_allreduce: 1.27 | step: 6.99
 81%|████████  | 2844/3507 [1:10:14<19:53,  1.80s/it]                                                     {'loss': 0.8673, 'learning_rate': 1.8174672210703626e-06, 'epoch': 0.81}
 81%|████████  | 2844/3507 [1:10:14<19:53,  1.80s/it]tensor([[-1.8828, -2.6719, -1.8594,  1.8984,  0.2793]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281,  0.6094,  2.4844, -1.0547, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6562, -4.7812,  0.2734, -0.5312, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -3.2188,  1.2891,  2.5469, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:55:01,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.15 | bwd_microstep: 2.92 | bwd_inner_microstep: 2.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.3125, -3.7344,  1.2578,  0.7148, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -3.9375, -0.5195,  2.1719, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -1.8672,  2.0938,  2.1250, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.7188, -4.9062,  1.0938,  0.6719, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:01,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.67 | optimizer_gradients: 0.25 | optimizer_step: 0.23
[2025-11-06 18:55:01,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.17 | bwd_microstep: 2.28 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 1.00 | step_microstep: 8.20
[2025-11-06 18:55:01,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.38 | bwd: 5.19 | bwd_inner: 3.91 | bwd_allreduce: 1.05 | step: 8.29
 81%|████████  | 2845/3507 [1:10:15<15:43,  1.43s/it]                                                     {'loss': 0.347, 'learning_rate': 1.8121606002599667e-06, 'epoch': 0.81}
 81%|████████  | 2845/3507 [1:10:15<15:43,  1.43s/it]tensor([[-3.1094, -3.0156, -0.9062,  1.8750, -1.0547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375e+00, -2.1250e+00,  1.5156e+00,  1.6861e-03, -3.9688e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:01,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.72 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.7500, -6.3125, -0.4629, -0.1650, -6.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -3.8750, -0.2197,  0.1089, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -2.9062,  0.7812,  0.0278, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2656, -2.8594, -2.1406,  1.0547, -0.3223]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.5625, -3.7969,  1.1328,  2.1094, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -3.0312,  0.6289,  3.1562, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:04,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:55:04,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.89 | bwd_microstep: 1.57 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.27
[2025-11-06 18:55:04,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.67 | bwd: 2.23 | bwd_inner: 1.28 | bwd_allreduce: 0.81 | step: 2.36
 81%|████████  | 2846/3507 [1:10:18<21:11,  1.92s/it]                                                     {'loss': 0.6565, 'learning_rate': 1.8068609658868774e-06, 'epoch': 0.81}
 81%|████████  | 2846/3507 [1:10:18<21:11,  1.92s/it]tensor([[-0.8516,  1.9375,  2.2031, -0.4961, -1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4375, -4.1562, -0.3574,  3.3125, -1.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:04,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.31 | bwd_microstep: 5.93 | bwd_inner_microstep: 5.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0938, -4.9688,  0.5859,  3.3438, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9766,  2.9375,  5.2188, -1.7734, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -4.0625, -0.4941, -0.0618, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3750, -1.0312,  2.9062, -2.1406, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -3.6250,  0.3496,  3.0469, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9062, -4.1875, -1.5859,  2.3594, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:55:05,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:55:05,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.54 | bwd_microstep: 195.80 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 194.67 | step_microstep: 1.74
[2025-11-06 18:55:05,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.23 | bwd: 201.73 | bwd_inner: 6.87 | bwd_allreduce: 194.73 | step: 1.83
 81%|████████  | 2847/3507 [1:10:18<16:36,  1.51s/it]                                                     {'loss': 0.6129, 'learning_rate': 1.801568322473115e-06, 'epoch': 0.81}
 81%|████████  | 2847/3507 [1:10:18<16:36,  1.51s/it]tensor([[-4.0312, -1.1797,  1.6562, -0.7109, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:05,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.32 | bwd_microstep: 1.85 | bwd_inner_microstep: 1.73 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.6250, -2.8125,  0.0708,  1.9609, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.2969,  0.6641,  2.5938, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -0.2695,  2.7656, -1.2891, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -4.1875, -1.1094,  1.1094, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.9688, -0.4316,  2.2969, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -1.3516,  1.3672,  2.4062, -1.2266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0938, -3.9688,  2.3594,  1.1797, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:55:07,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:55:07,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.40 | bwd_microstep: 610.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 610.00 | step_microstep: 1.94
[2025-11-06 18:55:07,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.75 | bwd: 612.74 | bwd_inner: 2.53 | bwd_allreduce: 610.05 | step: 2.03
 81%|████████  | 2848/3507 [1:10:20<18:08,  1.65s/it]                                                     {'loss': 0.1691, 'learning_rate': 1.7962826745347318e-06, 'epoch': 0.81}
 81%|████████  | 2848/3507 [1:10:20<18:08,  1.65s/it]tensor([[-3.6562, -3.7656, -1.3281,  1.9375, -1.3672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -2.0938,  1.8281, -0.3672, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -2.2344,  2.3750, -0.3145, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:07,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.03 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9375, -0.8555,  2.0469, -2.6562, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -4.3125, -0.8984,  3.7031, -1.1172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -4.4688, -1.1562,  2.6562, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -2.2031,  2.7344,  0.0679, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.0312, -3.9844,  1.7188,  0.3789, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:55:07,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:55:07,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.91 | bwd_microstep: 84.79 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 83.93 | step_microstep: 3.62
[2025-11-06 18:55:07,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.97 | bwd: 85.48 | bwd_inner: 1.36 | bwd_allreduce: 83.98 | step: 3.70
 81%|████████  | 2849/3507 [1:10:21<14:25,  1.32s/it]                                                     {'loss': 0.2563, 'learning_rate': 1.7910040265818118e-06, 'epoch': 0.81}
 81%|████████  | 2849/3507 [1:10:21<14:25,  1.32s/it]tensor([[-2.1875,  2.4844,  3.8594, -2.5938, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -5.2500, -1.1250,  2.5156, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:07,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.44 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-2.8906,  0.6406,  2.7500, -1.2188, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3438, -2.2031,  0.5117,  3.7969, -0.2520]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1250, -5.9688, -0.4180,  2.3125, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -3.0469,  1.3438,  1.1875, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7812, -6.3750, -1.6719,  2.2969, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.6250,  0.5117,  1.8828, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:55:10,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:55:10,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.46 | bwd_microstep: 2289.96 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2288.87 | step_microstep: 2.43
[2025-11-06 18:55:10,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.92 | bwd: 2290.81 | bwd_inner: 1.68 | bwd_allreduce: 2288.94 | step: 2.54
 81%|████████▏ | 2850/3507 [1:10:24<20:23,  1.86s/it]                                                     {'loss': 0.1954, 'learning_rate': 1.785732383118467e-06, 'epoch': 0.81}
 81%|████████▏ | 2850/3507 [1:10:24<20:23,  1.86s/it]tensor([[-6.0312, -2.5625,  2.3750, -0.1836, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -0.6680,  3.4375, -0.3809, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3125, -0.6992,  1.0469,  0.9961, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:10,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.14 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7500, -1.2812,  1.7109,  0.4414, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -2.1250,  1.2969,  1.0469, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.7812,  0.4980,  1.8438, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -4.1875,  0.2363,  2.9531, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2500, -4.0312,  1.5625,  2.0938, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:11,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:55:11,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.60 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.79 | step_microstep: 8.18
[2025-11-06 18:55:11,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.76 | bwd: 2.61 | bwd_inner: 1.65 | bwd_allreduce: 0.82 | step: 8.25
 81%|████████▏ | 2851/3507 [1:10:25<15:59,  1.46s/it]                                                     {'loss': 0.5878, 'learning_rate': 1.7804677486428335e-06, 'epoch': 0.81}
 81%|████████▏ | 2851/3507 [1:10:25<15:59,  1.46s/it]tensor([[-2.8750, -3.7812, -2.3750,  1.9141, -0.3359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -1.8125,  3.2969,  0.7852, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4062, -2.6875,  2.7656, -0.3125, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:11,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 267.30 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.5625,  1.7344,  3.4844, -2.3594, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -4.3438,  0.7578,  1.5859, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -4.3750, -0.3594,  2.3906, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6250, -3.4219,  2.1094,  0.1348, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -3.7812,  1.4375,  3.0312, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:55:13,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 18:55:13,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.02 | bwd_microstep: 417.39 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 416.46 | step_microstep: 2.14
[2025-11-06 18:55:13,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 397.35 | bwd: 418.33 | bwd_inner: 1.63 | bwd_allreduce: 416.52 | step: 2.25
 81%|████████▏ | 2852/3507 [1:10:27<18:50,  1.73s/it]                                                     {'loss': 0.3468, 'learning_rate': 1.7752101276470645e-06, 'epoch': 0.81}
 81%|████████▏ | 2852/3507 [1:10:27<18:50,  1.73s/it]tensor([[-6.5938, -5.4375, -0.4434,  1.8672, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -4.6250, -2.2500,  2.4219, -1.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -1.4531,  2.3750,  0.7656, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:13,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.57 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.5625, -3.8906,  0.9922,  0.2285, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.2031,  2.1094,  3.8125, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -4.3125, -2.1094,  2.6562, -0.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.2812, -5.7500,  0.1182,  2.1875, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.5000, -0.3516,  3.1719, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:55:15,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 18:55:15,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.00 | bwd_microstep: 1241.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 1241.02 | step_microstep: 2.19
[2025-11-06 18:55:15,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.60 | bwd: 1242.53 | bwd_inner: 1.33 | bwd_allreduce: 1241.06 | step: 2.26
 81%|████████▏ | 2853/3507 [1:10:29<19:24,  1.78s/it]                                                     {'loss': 0.4344, 'learning_rate': 1.7699595246173285e-06, 'epoch': 0.81}
 81%|████████▏ | 2853/3507 [1:10:29<19:24,  1.78s/it]tensor([[-6.8125, -6.5312, -2.0312,  2.1250, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -5.7812, -2.0781,  2.0938, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.7812, -1.5156,  2.2812, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:15,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.72 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -0.3145,  1.9297, -2.9531, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.8438, -4.6250,  1.2812,  1.8594, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -4.9062, -1.6641,  2.1250, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -0.1191,  3.2812, -2.5938, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4062, -3.2656,  1.1172,  3.4062, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:55:17,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:55:17,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.94 | bwd_microstep: 2020.31 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 2019.33 | step_microstep: 1.55
[2025-11-06 18:55:17,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.69 | bwd: 2020.97 | bwd_inner: 1.44 | bwd_allreduce: 2019.37 | step: 1.63
 81%|████████▏ | 2854/3507 [1:10:31<21:31,  1.98s/it]                                                     {'loss': 0.8082, 'learning_rate': 1.7647159440338136e-06, 'epoch': 0.81}
 81%|████████▏ | 2854/3507 [1:10:31<21:31,  1.98s/it]tensor([[-2.7500, -3.4688, -2.1562,  1.6406, -0.3711]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -3.8125,  0.0449,  2.3438, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -0.3281,  2.7812, -2.0000, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:18,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.66 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -1.5625,  2.7031, -0.1226, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -4.3750,  0.7656,  3.1562, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250, -2.6094, -0.2002,  3.4219, -0.4023]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3750, -4.7812, -0.3066,  1.0547, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -2.5156,  1.7891,  1.0781, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:55:19,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:55:19,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.09 | bwd_microstep: 665.70 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 664.51 | step_microstep: 1.86
[2025-11-06 18:55:19,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.77 | bwd: 666.64 | bwd_inner: 1.96 | bwd_allreduce: 664.55 | step: 1.94
 81%|████████▏ | 2855/3507 [1:10:33<19:55,  1.83s/it]                                                     {'loss': 0.447, 'learning_rate': 1.759479390370703e-06, 'epoch': 0.81}
 81%|████████▏ | 2855/3507 [1:10:33<19:55,  1.83s/it]tensor([[-1.6875, -2.5938, -2.0469,  1.7266,  0.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -4.3750,  0.1484,  2.4844, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -1.0234,  2.6406,  0.5898, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:19,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.30 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.1562, -1.6719,  3.0625,  2.8594, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.9531,  0.3711,  1.7812, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4219, -0.2373,  2.7812,  0.1128, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8281,  1.0312,  2.6875, -2.0312, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.9062, -0.9414,  3.9531, -2.1094, -6.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:55:20,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:55:20,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.58 | bwd_microstep: 1063.43 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1062.28 | step_microstep: 2.08
[2025-11-06 18:55:20,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.89 | bwd: 1064.43 | bwd_inner: 1.96 | bwd_allreduce: 1062.33 | step: 2.17
 81%|████████▏ | 2856/3507 [1:10:34<18:36,  1.71s/it]                                                     {'loss': 0.4194, 'learning_rate': 1.7542498680961917e-06, 'epoch': 0.81}
 81%|████████▏ | 2856/3507 [1:10:34<18:36,  1.71s/it]tensor([[-2.3594,  1.5391,  3.2812, -1.8828, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:20,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.74 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0312, -0.3359,  3.3281, -0.3789, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -1.7969,  0.7852,  0.8906, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -1.2188,  3.7656, -0.3086, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9531, -1.0625,  2.8125,  0.7578, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969, -3.0781, -2.5000,  0.9297, -0.1533]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-7.9062, -5.2188,  1.0391,  0.7305, -5.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -4.4062, -0.2520,  3.3750, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:55:21,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:55:21,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.99 | bwd_microstep: 176.76 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 175.61 | step_microstep: 2.08
[2025-11-06 18:55:21,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.73 | bwd: 177.78 | bwd_inner: 1.97 | bwd_allreduce: 175.65 | step: 2.17
 81%|████████▏ | 2857/3507 [1:10:35<14:34,  1.35s/it]                                                     {'loss': 0.3886, 'learning_rate': 1.7490273816724734e-06, 'epoch': 0.81}
 81%|████████▏ | 2857/3507 [1:10:35<14:34,  1.35s/it]tensor([[-2.0469,  2.4531,  4.1250, -1.8281, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:55:21,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 101.54 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.8750, -4.1875,  0.3535,  3.4375, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0781, -0.0850,  2.8750,  0.2207, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -3.0625,  2.3281,  1.2969, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -3.0000,  2.6406,  1.5078, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -1.8047,  2.1875,  0.4375, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2812, -6.6562, -1.8750,  1.6016, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7891,  1.7188,  3.3125, -0.6719, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:55:21,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:55:21,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.56 | bwd_microstep: 274.32 | bwd_inner_microstep: 1.28 | bwd_allreduce_microstep: 272.96 | step_microstep: 1.54
[2025-11-06 18:55:21,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 208.11 | bwd: 275.28 | bwd_inner: 2.17 | bwd_allreduce: 272.99 | step: 1.61
 81%|████████▏ | 2858/3507 [1:10:35<11:50,  1.09s/it]                                                     {'loss': 0.5085, 'learning_rate': 1.7438119355557425e-06, 'epoch': 0.81}
 81%|████████▏ | 2858/3507 [1:10:35<11:50,  1.09s/it]tensor([[-2.3594,  1.3438,  4.3125,  0.3457, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -2.2500,  2.8594, -2.2500, -6.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -2.2188,  3.1562,  0.0371, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:22,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.91 | bwd_microstep: 1.28 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6562, -4.5938, -2.8906,  1.8516, -0.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -2.3125,  1.1016,  0.0564, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -4.1250,  1.4453,  2.6562, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1562, -4.3438,  0.1992,  3.0781, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -1.5547,  3.1094, -0.7383, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:55:25,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.18 | optimizer_step: 0.30
[2025-11-06 18:55:25,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.54 | bwd_microstep: 2268.50 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 2267.19 | step_microstep: 2.26
[2025-11-06 18:55:25,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 508.48 | bwd: 2269.78 | bwd_inner: 2.38 | bwd_allreduce: 2267.24 | step: 2.33
 82%|████████▏ | 2859/3507 [1:10:38<18:37,  1.72s/it]                                                     {'loss': 0.2425, 'learning_rate': 1.7386035341961805e-06, 'epoch': 0.82}
 82%|████████▏ | 2859/3507 [1:10:38<18:37,  1.72s/it]tensor([[-5.2812, -3.8438,  1.0625,  3.0000, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:25,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.56 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.1406, -3.3594, -0.4863,  3.5312, -0.6680]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0938, -2.7969, -0.2266,  4.5000,  0.4160]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -3.0625,  1.7578,  0.3750, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.7500, -0.8047,  3.0000, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -5.5000, -1.1016,  1.8438, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -1.6250,  3.9062, -0.4043, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -2.1250,  3.2031,  1.0312, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:55:25,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 18:55:25,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.02 | bwd_microstep: 153.58 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 152.53 | step_microstep: 2.06
[2025-11-06 18:55:25,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.60 | bwd: 154.64 | bwd_inner: 1.92 | bwd_allreduce: 152.58 | step: 2.15
 82%|████████▏ | 2860/3507 [1:10:39<14:48,  1.37s/it]                                                     {'loss': 0.8268, 'learning_rate': 1.7334021820379588e-06, 'epoch': 0.82}
 82%|████████▏ | 2860/3507 [1:10:39<14:48,  1.37s/it]tensor([[-3.5312, -2.3438,  0.5547,  1.8594, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:25,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.14 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.7812, -1.6328,  3.4844, -0.5938, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -3.5156,  1.2812,  1.9453, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6406,  0.5859,  0.3555, -1.5859, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.4375, -3.8594,  1.3594,  0.8789, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[h264 @ 0xc3cfc80] top block unavailable for requested intra mode -1
[h264 @ 0xc3cfc80] error while decoding MB 9 0, bytestream 17940
[h264 @ 0xc40de40] top block unavailable for requested intra mode -1
[h264 @ 0xc40de40] error while decoding MB 9 0, bytestream 17940
[h264 @ 0xc40de40] gray chroma
[h264 @ 0xc40de40] error while decoding MB 10 11, bytestream 348
[h264 @ 0xc40de40] error while decoding MB 10 8, bytestream -27
[h264 @ 0xc40de40] error while decoding MB 2 6, bytestream -55
[h264 @ 0xc40de40] Reference -1 >= 15
[h264 @ 0xc40de40] error while decoding MB 12 6, bytestream 1079
[h264 @ 0xc40de40] Reference -1 >= 16
[h264 @ 0xc40de40] error while decoding MB 5 4, bytestream 1422
[h264 @ 0xc40de40] error while decoding MB 14 3, bytestream -21
[h264 @ 0xc40de40] Reference 24 >= 15
[h264 @ 0xc40de40] error while decoding MB 11 2, bytestream 968
[h264 @ 0xc40de40] error while decoding MB 0 13, bytestream -16
[h264 @ 0xc40de40] Reference 16 >= 16
[h264 @ 0xc40de40] error while decoding MB 9 8, bytestream 693
[h264 @ 0xc40de40] cabac decode of qscale diff failed at 10 2
[h264 @ 0xc40de40] error while decoding MB 10 2, bytestream 3051
[h264 @ 0xc40de40] Reference -1 >= 15
[h264 @ 0xc40de40] error while decoding MB 13 7, bytestream 1631
[h264 @ 0xc40de40] Reference -1 >= 15
[h264 @ 0xc40de40] error while decoding MB 3 7, bytestream 414
tensor([[-4.9688, -3.6406,  0.4512,  2.2031, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1250, -3.3281,  0.8398, -0.7578, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -3.1406,  0.8477,  0.6914, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:55:27,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 18:55:27,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.91 | bwd_microstep: 427.93 | bwd_inner_microstep: 5.27 | bwd_allreduce_microstep: 422.57 | step_microstep: 7.11
[2025-11-06 18:55:27,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.06 | bwd: 428.90 | bwd_inner: 6.13 | bwd_allreduce: 422.61 | step: 7.22
 82%|████████▏ | 2861/3507 [1:10:41<17:01,  1.58s/it]                                                     {'loss': 0.5128, 'learning_rate': 1.7282078835192362e-06, 'epoch': 0.82}
 82%|████████▏ | 2861/3507 [1:10:41<17:01,  1.58s/it]tensor([[-3.7188, -0.0068,  2.3125, -1.9688, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -2.3281,  1.1484, -0.4570, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -1.8828,  1.5938, -0.3789, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.7812, -1.6094,  1.4531, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:27,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.61 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.3750, -0.9492,  1.8438, -1.3594, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -4.7188, -1.0000,  2.9062, -1.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.4297,  1.7031,  1.8750, -1.5469, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0312, -4.5312, -0.9258,  1.9453, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:28,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:55:28,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.57 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.33
[2025-11-06 18:55:28,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 424.21 | bwd: 2.95 | bwd_inner: 1.81 | bwd_allreduce: 0.95 | step: 2.44
 82%|████████▏ | 2862/3507 [1:10:41<13:26,  1.25s/it]                                                     {'loss': 0.3984, 'learning_rate': 1.7230206430721508e-06, 'epoch': 0.82}
 82%|████████▏ | 2862/3507 [1:10:41<13:26,  1.25s/it]tensor([[-5.2812, -4.5000, -0.3457,  2.4219, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4375, -5.9375, -0.8867,  0.7656, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9375, -5.0625,  0.8516,  2.1562, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.1719,  1.4453,  2.1406, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:28,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.67 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9688, -3.7031, -0.7891,  2.2969, -1.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -3.7656, -0.1147,  3.2812, -1.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.0625, -7.3438, -2.4219,  0.9883, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2812, -4.7500,  1.7031,  1.9219, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:30,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:55:30,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.93 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.77 | step_microstep: 3.45
[2025-11-06 18:55:30,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 481.62 | bwd: 2.37 | bwd_inner: 1.42 | bwd_allreduce: 0.81 | step: 3.54
 82%|████████▏ | 2863/3507 [1:10:44<18:11,  1.69s/it]                                                     {'loss': 0.3189, 'learning_rate': 1.7178404651228187e-06, 'epoch': 0.82}
 82%|████████▏ | 2863/3507 [1:10:44<18:11,  1.69s/it]tensor([[-4.4375, -4.6562, -0.9531,  3.4062, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219,  0.9180,  4.4062, -2.2188, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -4.0938,  0.6367,  1.9375, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -5.1250, -0.9844,  3.6406, -1.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -5.4688,  0.2256,  2.5312, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -2.2500,  1.7422,  0.5977, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.7500, -6.7500, -2.2500,  0.2334, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:33,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.03 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-0.1982,  1.8281,  2.2188,  0.3145, -0.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:55:34,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 18:55:34,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 471.98 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.90 | step_microstep: 81.91
[2025-11-06 18:55:34,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 657.04 | bwd: 2.91 | bwd_inner: 1.79 | bwd_allreduce: 0.96 | step: 82.02
 82%|████████▏ | 2864/3507 [1:10:47<23:17,  2.17s/it]                                                     {'loss': 0.2327, 'learning_rate': 1.7126673540913308e-06, 'epoch': 0.82}
 82%|████████▏ | 2864/3507 [1:10:47<23:17,  2.17s/it]tensor([[-4.4375, -4.5312, -0.9609,  3.0312, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -4.0000, -0.1748,  3.7188, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:34,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.84 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1562, -1.5469,  2.7812, -0.5273, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -4.5000, -2.4062,  1.7578, -1.3672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -5.0000, -2.8125,  1.8828, -1.2734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -4.8438, -1.0312,  3.1406, -1.9766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -4.5312,  0.2217,  1.7422, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -6.5938, -2.0938,  1.7734, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:35,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:55:35,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.74 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.05
[2025-11-06 18:55:35,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.61 | bwd: 2.61 | bwd_inner: 1.64 | bwd_allreduce: 0.83 | step: 2.15
 82%|████████▏ | 2865/3507 [1:10:48<19:26,  1.82s/it]                                                     {'loss': 0.0589, 'learning_rate': 1.7075013143917473e-06, 'epoch': 0.82}
 82%|████████▏ | 2865/3507 [1:10:48<19:26,  1.82s/it]tensor([[-5.4062, -3.5000,  1.0156,  1.5547, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -1.0391,  1.5547, -0.5195, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -2.6719,  0.8711,  1.1016, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -1.7031,  2.0312, -0.5430, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9062, -5.9688, -1.9609,  2.5781, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.0625,  0.3926,  2.0312, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.4375, -0.0938,  1.1406, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:36,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.3125, -2.3750,  1.4844,  1.6875, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:36,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:55:36,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.38 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.06
[2025-11-06 18:55:36,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.02 | bwd: 3.04 | bwd_inner: 1.97 | bwd_allreduce: 0.93 | step: 2.15
 82%|████████▏ | 2866/3507 [1:10:50<17:22,  1.63s/it]                                                     {'loss': 0.6573, 'learning_rate': 1.7023423504320934e-06, 'epoch': 0.82}
 82%|████████▏ | 2866/3507 [1:10:50<17:22,  1.63s/it]tensor([[-5.9375, -2.9375,  1.2188, -0.6992, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -4.4688, -2.2031,  2.2031, -1.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:36,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.33 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-2.1562,  2.2031,  2.8125, -3.2188, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.8750, -6.3438, -2.9375,  1.9453, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -0.6289,  3.1250, -1.4141, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.1875, -6.1250,  0.2793,  1.5312, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -6.8750, -2.2344,  1.0312, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6875, -6.9062, -2.8438,  1.7422, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:37,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:55:37,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.45 | bwd_microstep: 2.45 | bwd_inner_microstep: 1.51 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.42
[2025-11-06 18:55:37,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.76 | bwd: 3.31 | bwd_inner: 2.27 | bwd_allreduce: 0.89 | step: 2.51
 82%|████████▏ | 2867/3507 [1:10:51<16:25,  1.54s/it]                                                     {'loss': 0.2029, 'learning_rate': 1.6971904666143602e-06, 'epoch': 0.82}
 82%|████████▏ | 2867/3507 [1:10:51<16:25,  1.54s/it]tensor([[-5.3750, -1.3047,  2.6562, -1.3906, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -3.3438,  1.8203,  0.5859, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -3.8438,  0.4102,  1.3281, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -3.4531,  1.3594,  1.3359, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5625, -0.3926,  2.3750, -0.5039, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -0.1328,  3.4375, -1.6250, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -2.5469,  2.9219, -0.6055, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:39,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.51 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.4688, -5.5312, -1.7500,  2.4062, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:39,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.16 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:55:39,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.41 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 3.26
[2025-11-06 18:55:39,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.94 | bwd: 2.98 | bwd_inner: 1.94 | bwd_allreduce: 0.91 | step: 3.35
 82%|████████▏ | 2868/3507 [1:10:53<16:52,  1.58s/it]                                                     {'loss': 0.1926, 'learning_rate': 1.6920456673344931e-06, 'epoch': 0.82}
 82%|████████▏ | 2868/3507 [1:10:53<16:52,  1.58s/it]tensor([[-5.9375, -2.6562,  2.3750,  0.1670, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -1.1094,  0.9688, -2.3281, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -2.6094,  1.7344, -0.2559, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:39,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.79 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.78 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.22
tensor([[1.7266, 0.8945, 1.5234, 5.2188, 3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1875,  0.5703,  3.2188, -0.8828, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -2.9375,  1.0469,  0.7969, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531,  1.0781,  3.1250, -1.9609, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9219, -3.3438, -0.3828,  1.7578, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:41,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.23 | optimizer_step: 0.26
[2025-11-06 18:55:41,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.83 | bwd_microstep: 2.49 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 1.16 | step_microstep: 101.02
[2025-11-06 18:55:41,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.68 | bwd: 4.58 | bwd_inner: 2.97 | bwd_allreduce: 1.28 | step: 101.23
 82%|████████▏ | 2869/3507 [1:10:55<19:47,  1.86s/it]                                                     {'loss': 1.0207, 'learning_rate': 1.6869079569823932e-06, 'epoch': 0.82}
 82%|████████▏ | 2869/3507 [1:10:55<19:47,  1.86s/it]tensor([[-3.5000, -1.3594,  1.2891,  0.5000, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -4.0625, -0.0869,  2.0312, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312, -3.9062, -0.5195,  4.1250, -0.7305]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:42,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.32 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.3281, -2.3750,  1.3047,  3.3750, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4531, -0.0879,  1.4844,  0.0688, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000,  1.4219,  4.1250, -0.5781, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.9531,  1.1484,  3.4688, -1.7969, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -1.7266,  2.3438, -0.3145, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:55:42,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:55:42,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.76 | bwd_microstep: 89.66 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 88.68 | step_microstep: 2.25
[2025-11-06 18:55:42,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.11 | bwd: 90.40 | bwd_inner: 1.57 | bwd_allreduce: 88.71 | step: 2.32
 82%|████████▏ | 2870/3507 [1:10:56<15:22,  1.45s/it]                                                     {'loss': 1.0384, 'learning_rate': 1.6817773399419201e-06, 'epoch': 0.82}
 82%|████████▏ | 2870/3507 [1:10:56<15:22,  1.45s/it]tensor([[-4.9375, -0.8555,  3.6406, -0.3398, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -5.3125, -0.9766,  2.5781, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -3.0156,  2.5625,  0.5352, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -0.8594,  3.3750, -0.6914, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:42,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.72 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -2.7656,  0.8750,  0.0835, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -6.3125, -1.7422,  1.3359, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -5.8438, -1.4375,  2.5156, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1875,  0.8203,  3.1719, -1.9688, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:45,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:55:45,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.21 | bwd_microstep: 2.33 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.14
[2025-11-06 18:55:45,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 626.96 | bwd: 3.20 | bwd_inner: 2.15 | bwd_allreduce: 0.90 | step: 2.22
 82%|████████▏ | 2871/3507 [1:10:58<19:27,  1.84s/it]                                                     {'loss': 0.5954, 'learning_rate': 1.6766538205908734e-06, 'epoch': 0.82}
 82%|████████▏ | 2871/3507 [1:10:58<19:27,  1.84s/it]tensor([[-4.9062, -4.3125, -0.6562,  2.0625, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:45,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.66 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0312, -4.6875, -0.4844,  3.2656, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -4.3438,  0.2197,  1.8281, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5000, -5.0312,  1.3984,  1.7344, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -2.7812,  2.5312,  0.9180, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -3.5312, -0.0240,  3.4531, -1.3672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -5.0000, -0.2373,  3.0000, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.1250, -6.6250, -0.3691,  2.0000, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:55:45,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:55:45,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.37 | bwd_microstep: 26.58 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 25.28 | step_microstep: 1.85
[2025-11-06 18:55:45,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 413.02 | bwd: 27.50 | bwd_inner: 2.03 | bwd_allreduce: 25.33 | step: 1.93
 82%|████████▏ | 2872/3507 [1:10:59<15:07,  1.43s/it]                                                     {'loss': 0.6178, 'learning_rate': 1.6715374033009945e-06, 'epoch': 0.82}
 82%|████████▏ | 2872/3507 [1:10:59<15:07,  1.43s/it]tensor([[-4.1875, -4.5625, -2.0312,  1.8906, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:45,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.52 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.8750, -4.0625, -0.1260, -1.5391, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -3.8125,  1.0000,  1.2500, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.4688, -4.5625, -0.1797,  2.5312, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([3], device='cuda:3')
tensor([[-6.2812, -4.8125, -1.2734, -0.1299, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -0.5469,  2.3281, -1.7500, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -5.3125, -1.0156,  3.2812, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219, -2.4688,  1.6797,  4.0000, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:47,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:55:47,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.86 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.59
[2025-11-06 18:55:47,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.40 | bwd: 2.91 | bwd_inner: 1.92 | bwd_allreduce: 0.85 | step: 2.68
 82%|████████▏ | 2873/3507 [1:11:01<18:02,  1.71s/it]                                                     {'loss': 0.644, 'learning_rate': 1.6664280924379682e-06, 'epoch': 0.82}
 82%|████████▏ | 2873/3507 [1:11:01<18:02,  1.71s/it]tensor([[-4.5625, -2.7812,  1.7266,  2.6250, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:48,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.78 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -4.0312, -0.0864,  0.0371, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -5.5312, -0.6602,  1.5938, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8359,  1.7422,  2.9688, -0.9102, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3750, -5.6562,  0.6172,  2.5156, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -3.4375,  0.9883,  2.5625, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -2.2812,  1.4766,  0.8906, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1406,  1.9844,  3.4688, -1.7734, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:55:48,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:55:48,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.63 | bwd_microstep: 199.76 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 198.88 | step_microstep: 1.55
[2025-11-06 18:55:48,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.43 | bwd: 200.72 | bwd_inner: 1.67 | bwd_allreduce: 198.93 | step: 1.64
 82%|████████▏ | 2874/3507 [1:11:02<14:13,  1.35s/it]                                                     {'loss': 0.686, 'learning_rate': 1.6613258923614217e-06, 'epoch': 0.82}
 82%|████████▏ | 2874/3507 [1:11:02<14:13,  1.35s/it]tensor([[-6.9688, -4.9062,  0.8633,  1.5859, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -5.5312, -3.1094,  1.3125, -1.9453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219, -4.3750, -2.4844,  2.1719, -0.6836]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[2.9375, 5.1875, 6.4688, 4.6875, 2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:48,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -2.4688,  1.5234,  1.0703, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.8125, -5.6875, -0.8438, -2.6250, -7.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -2.3750,  0.8242, -0.4082, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6562, -5.5312,  0.8008,  1.7734, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:51,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:55:51,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.73 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.41
[2025-11-06 18:55:51,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.51 | bwd: 2.57 | bwd_inner: 1.58 | bwd_allreduce: 0.85 | step: 2.49
 82%|████████▏ | 2875/3507 [1:11:05<19:43,  1.87s/it]                                                     {'loss': 1.277, 'learning_rate': 1.6562308074249045e-06, 'epoch': 0.82}
 82%|████████▏ | 2875/3507 [1:11:05<19:43,  1.87s/it]tensor([[-5.7500, -1.2734,  4.1562, -0.5156, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -1.2578,  2.7031,  0.3262, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -4.1562,  0.2100,  1.6875, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:51,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.05 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4062, -2.5156,  2.7656, -0.8203, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -4.5625,  0.1348,  1.8594, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -4.3438, -1.6250,  2.5781, -1.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.7188, -5.0312,  0.6797,  0.3594, -5.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0000, -5.5938, -0.8906,  3.0938, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:55:51,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.34 | optimizer_step: 0.23
[2025-11-06 18:55:51,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.91 | bwd_microstep: 78.74 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 77.57 | step_microstep: 2.31
[2025-11-06 18:55:51,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.98 | bwd: 79.70 | bwd_inner: 1.96 | bwd_allreduce: 77.61 | step: 2.41
 82%|████████▏ | 2876/3507 [1:11:05<15:16,  1.45s/it]                                                     {'loss': 0.1427, 'learning_rate': 1.6511428419759012e-06, 'epoch': 0.82}
 82%|████████▏ | 2876/3507 [1:11:05<15:16,  1.45s/it]tensor([[-5.6875, -4.0625,  0.7539,  2.2188, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1094,  1.3281,  2.7188, -0.6172, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -3.5938,  0.4688,  1.2422, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:52,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.80 | bwd_microstep: 1.17 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15
tensor([[-5.9688, -3.3281,  1.5625,  0.7344, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.0625,  1.6641,  2.6406, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6719,  1.5859,  3.0469, -2.4062, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9531, -1.7500,  2.1094,  1.7891, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.6250, -5.0312,  1.1953,  1.2109, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:54,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:55:54,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.34 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.07
[2025-11-06 18:55:54,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.18 | bwd: 3.03 | bwd_inner: 1.87 | bwd_allreduce: 0.93 | step: 2.23
 82%|████████▏ | 2877/3507 [1:11:08<18:02,  1.72s/it]                                                     {'loss': 0.6644, 'learning_rate': 1.6460620003558193e-06, 'epoch': 0.82}
 82%|████████▏ | 2877/3507 [1:11:08<18:02,  1.72s/it]tensor([[-3.6875, -3.4375, -0.7773,  1.9219, -1.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -0.0354,  2.6562, -0.5703, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -3.7812,  0.4609,  2.1094, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1406,  0.6953,  2.6719, -2.2812, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:55:54,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.88 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0000, -0.9336,  3.1562, -1.0000, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4688, -4.2812, -1.0469,  2.5625, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9062, -2.4375,  1.3828,  0.5820, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -2.3906,  2.9531, -0.8477, -5.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:55:54,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:55:54,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 111.92 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 110.74 | step_microstep: 1.79
[2025-11-06 18:55:54,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.00 | bwd: 112.52 | bwd_inner: 1.61 | bwd_allreduce: 110.78 | step: 1.87
 82%|████████▏ | 2878/3507 [1:11:08<14:04,  1.34s/it]                                                     {'loss': 0.9175, 'learning_rate': 1.6409882868999883e-06, 'epoch': 0.82}
 82%|████████▏ | 2878/3507 [1:11:08<14:04,  1.34s/it]tensor([[-5.3438, -2.2969,  2.8281,  1.2266, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -4.2188, -0.3965,  2.6094, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -3.6250,  1.4609,  1.7500, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:54,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.01 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.7188, -5.0000, -0.7656,  2.0000, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -4.2812,  1.6875,  2.2344, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -2.1094,  2.3125,  1.6250, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -3.8594,  1.0703,  0.9688, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -1.5703,  2.7656,  0.7383, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:57,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:55:57,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.09 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.77 | step_microstep: 2.84
[2025-11-06 18:55:57,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.12 | bwd: 2.55 | bwd_inner: 1.61 | bwd_allreduce: 0.80 | step: 2.93
 82%|████████▏ | 2879/3507 [1:11:11<17:22,  1.66s/it]                                                     {'loss': 0.7213, 'learning_rate': 1.6359217059376552e-06, 'epoch': 0.82}
 82%|████████▏ | 2879/3507 [1:11:11<17:22,  1.66s/it]tensor([[-4.2188, -1.2578,  1.8359, -0.4023, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:55:57,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.89 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-10.0625,  -8.8125,  -3.5312,  -1.1484,  -6.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -4.8125,  0.0178,  1.2891, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3828,  2.9219,  2.2656, -2.1250, -1.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.5625, -5.6250,  0.4668,  1.7656, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875,  1.2422,  3.0938, -2.0469, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -0.7891,  3.1406, -1.3906, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3750, -1.1172,  2.4844, -0.3164, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:55:57,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:55:57,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.16 | bwd_microstep: 20.25 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 19.19 | step_microstep: 1.92
[2025-11-06 18:55:57,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.09 | bwd: 21.18 | bwd_inner: 1.75 | bwd_allreduce: 19.24 | step: 2.02
 82%|████████▏ | 2880/3507 [1:11:11<13:40,  1.31s/it]                                                     {'loss': 0.1825, 'learning_rate': 1.6308622617919823e-06, 'epoch': 0.82}
 82%|████████▏ | 2880/3507 [1:11:11<13:40,  1.31s/it]tensor([[-4.3438, -4.7500, -1.7891,  2.7188, -1.3984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:57,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.24 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0625, -1.8828,  2.3750,  2.2812, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -3.3594,  0.6406,  1.2969, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -4.9375, -1.1797,  2.6562, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[2.3594, 5.2500, 6.7500, 3.7031, 1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -1.5234,  2.2812, -1.7109, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3281, -0.0549,  0.6094, -2.8125, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -4.9375, -0.3867,  0.7773, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:55:59,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.16 | optimizer_step: 0.23
[2025-11-06 18:55:59,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.98 | bwd_microstep: 2.21 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.99 | step_microstep: 2.30
[2025-11-06 18:55:59,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.24 | bwd: 3.03 | bwd_inner: 1.82 | bwd_allreduce: 1.03 | step: 2.40
 82%|████████▏ | 2881/3507 [1:11:13<16:49,  1.61s/it]                                                     {'loss': 0.3662, 'learning_rate': 1.625809958780037e-06, 'epoch': 0.82}
 82%|████████▏ | 2881/3507 [1:11:13<16:49,  1.61s/it]tensor([[-2.7344, -3.5156, -1.8047,  2.3125, -0.3027]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -1.4375,  3.1250, -0.8672, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -0.4316,  3.4062, -1.6797, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.3750, -5.6562,  0.4023,  2.0781, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:00,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.74 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.4531, -1.3984,  1.9297,  1.2734, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.0625, -5.2188,  0.7227, -0.1406, -6.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438, -5.1250,  0.6367,  2.3906, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.7188, -5.6562, -0.1816,  2.7031, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:00,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.13 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:56:00,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.64 | bwd_microstep: 348.93 | bwd_inner_microstep: 5.52 | bwd_allreduce_microstep: 343.33 | step_microstep: 3.24
[2025-11-06 18:56:00,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 576.41 | bwd: 349.63 | bwd_inner: 6.11 | bwd_allreduce: 343.37 | step: 3.32
 82%|████████▏ | 2882/3507 [1:11:14<14:48,  1.42s/it]                                                     {'loss': 0.1685, 'learning_rate': 1.6207648012128063e-06, 'epoch': 0.82}
 82%|████████▏ | 2882/3507 [1:11:14<14:48,  1.42s/it]tensor([[-6.2188, -1.7734,  3.2812, -1.6094, -6.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -4.7188, -0.2695,  1.6406, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:01,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.38 | bwd_microstep: 1.55 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-4.1875, -2.1719,  2.1719,  2.5469, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.3125, -7.0000, -2.2656, -0.4141, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.2188, -5.0000,  1.1953,  1.8828, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -1.8281,  2.7188,  0.6172, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.5938, -4.8438, -0.1162,  2.8750, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -4.3438, -1.8281,  2.6250, -0.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:03,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.22 | optimizer_step: 0.23
[2025-11-06 18:56:03,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.94 | bwd_microstep: 2.28 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.99 | step_microstep: 12.22
[2025-11-06 18:56:03,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.33 | bwd: 3.83 | bwd_inner: 2.58 | bwd_allreduce: 1.05 | step: 12.35
 82%|████████▏ | 2883/3507 [1:11:16<17:07,  1.65s/it]                                                     {'loss': 0.7539, 'learning_rate': 1.6157267933951637e-06, 'epoch': 0.82}
 82%|████████▏ | 2883/3507 [1:11:16<17:07,  1.65s/it]tensor([[-6.3750, -6.1875, -1.7812,  2.4062, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -4.4375, -0.4531,  2.5625, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:03,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.57 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5938, -3.7188, -0.7227,  3.0156, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7031, -0.8828,  0.7969,  1.6719, -0.7695]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812,  1.0078,  2.2969, -2.2969, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-6.3125, -6.6562, -3.2969,  1.3906, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -2.4062,  1.2266, -2.2031, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2188, -4.4375, -0.7812, -2.2344, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:03,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.14 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:56:03,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.06 | bwd_microstep: 366.45 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 365.61 | step_microstep: 3.34
[2025-11-06 18:56:03,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.66 | bwd: 367.26 | bwd_inner: 1.46 | bwd_allreduce: 365.65 | step: 3.42
 82%|████████▏ | 2884/3507 [1:11:17<14:18,  1.38s/it]                                                     {'loss': 0.3077, 'learning_rate': 1.6106959396258926e-06, 'epoch': 0.82}
 82%|████████▏ | 2884/3507 [1:11:17<14:18,  1.38s/it]tensor([[-4.4688, -5.2500, -2.8750,  1.8203, -1.4766]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.1406, -1.3203,  0.8867,  0.6992, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:04,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.43 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -1.1250,  2.6562, -0.9102, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.3438, -1.4531,  2.7656, -1.4766, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7031, -3.5156, -1.8516,  2.2344, -0.2852]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7500,  0.8008,  3.3750, -2.7969, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -1.6484,  2.9219, -0.3750, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -3.7344,  0.4277,  1.1953, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:05,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 12.18 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:56:05,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.27 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.85 | step_microstep: 14.49
[2025-11-06 18:56:05,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.72 | bwd: 2.86 | bwd_inner: 1.82 | bwd_allreduce: 0.88 | step: 14.58
 82%|████████▏ | 2885/3507 [1:11:19<14:35,  1.41s/it]                                                     {'loss': 0.6523, 'learning_rate': 1.6056722441976668e-06, 'epoch': 0.82}
 82%|████████▏ | 2885/3507 [1:11:19<14:35,  1.41s/it]tensor([[-5.6562, -5.5000, -1.6094,  2.1719, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:05,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.06 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.8438, -4.7500, -0.3301,  0.0996, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7500, -5.5000, -0.4941,  1.6719, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7812, -5.8750, -2.0000,  2.4531, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.4688, -7.7812, -3.9062,  0.8477, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -1.6641,  2.7344, -0.7969, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7812, -3.7500, -3.0156,  0.8398, -0.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -4.2500, -0.1025,  2.5000, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:08,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 18:56:08,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.53 | bwd_microstep: 2363.38 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 2362.52 | step_microstep: 2.54
[2025-11-06 18:56:08,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.61 | bwd: 2364.37 | bwd_inner: 1.64 | bwd_allreduce: 2362.58 | step: 2.64
 82%|████████▏ | 2886/3507 [1:11:21<18:34,  1.79s/it]                                                     {'loss': 0.1373, 'learning_rate': 1.600655711397059e-06, 'epoch': 0.82}
 82%|████████▏ | 2886/3507 [1:11:21<18:34,  1.79s/it]tensor([[-5.6562, -1.8828,  3.2500,  0.0415, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:08,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.06 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-1.6719, -2.5781, -2.7344,  0.5938,  0.2832]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.0625, -3.5625,  0.7070,  2.0312, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -4.1250,  0.2676,  3.9531, -1.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -5.4375, -1.8516,  2.7812, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -4.0938,  1.5312,  1.3984, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2031, -3.6406, -1.8125,  1.8047, -0.8555]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -4.7500, -2.3281,  2.0000, -1.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 18:56:08,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 18:56:08,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.16 | bwd_microstep: 142.03 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 140.96 | step_microstep: 1.97
[2025-11-06 18:56:08,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.23 | bwd: 142.98 | bwd_inner: 1.78 | bwd_allreduce: 141.03 | step: 2.09
 82%|████████▏ | 2887/3507 [1:11:22<14:34,  1.41s/it]                                                     {'loss': 1.3338, 'learning_rate': 1.5956463455045268e-06, 'epoch': 0.82}
 82%|████████▏ | 2887/3507 [1:11:22<14:34,  1.41s/it]tensor([[-6.3125, -6.2812, -2.5625,  1.5625, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2812, -3.9062,  1.3203,  1.0625, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.3750,  0.0737,  1.9062, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -2.6406,  2.6094, -0.9336, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:09,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.67 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.0625, -4.0000,  0.0972,  2.0156, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8281,  1.1172,  4.0625, -2.6875, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -6.5000, -2.8438,  0.0771, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.0625, -4.6250,  1.2734, -0.7266, -6.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:11,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.17 | optimizer_step: 0.22
[2025-11-06 18:56:11,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.85 | bwd_microstep: 1700.74 | bwd_inner_microstep: 3.19 | bwd_allreduce_microstep: 1697.43 | step_microstep: 1.91
[2025-11-06 18:56:11,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.54 | bwd: 1701.57 | bwd_inner: 3.92 | bwd_allreduce: 1697.48 | step: 2.01
 82%|████████▏ | 2888/3507 [1:11:25<19:57,  1.93s/it]                                                     {'loss': 0.181, 'learning_rate': 1.5906441507944059e-06, 'epoch': 0.82}
 82%|████████▏ | 2888/3507 [1:11:25<19:57,  1.93s/it]tensor([[-5.8438, -5.9688, -2.5625,  1.5469, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.1250,  0.2871,  2.3438, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -4.2500,  1.3516,  0.0596, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:11,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.22 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.4375, -4.7500, -1.3672,  3.1875, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500, -1.3828,  1.1172,  1.9922, -1.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -3.1719,  0.0674,  3.3125, -1.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -2.8750,  1.0625,  1.6172, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0625, -4.9688, -1.3750,  2.3750, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:56:12,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:56:12,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.70 | bwd_microstep: 7.56 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 6.43 | step_microstep: 3.10
[2025-11-06 18:56:12,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.95 | bwd: 8.57 | bwd_inner: 1.96 | bwd_allreduce: 6.47 | step: 3.19
 82%|████████▏ | 2889/3507 [1:11:26<15:21,  1.49s/it]                                                     {'loss': 0.2415, 'learning_rate': 1.5856491315349199e-06, 'epoch': 0.82}
 82%|████████▏ | 2889/3507 [1:11:26<15:21,  1.49s/it]tensor([[-2.7188,  0.3223,  3.3438,  0.7148, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -0.5898,  2.1875, -1.4219, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -3.4219,  0.8164,  0.4102, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -0.9961,  2.0156, -0.6016, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -1.0781,  2.6719, -0.3398, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:12,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.22 | bwd_microstep: 1.14 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1250, -3.8750,  0.6406,  2.7656, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -2.0469,  2.7656, -0.9102, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -5.8438, -1.0469,  2.9531, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:14,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.16 | optimizer_step: 0.23
[2025-11-06 18:56:14,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.77 | bwd_microstep: 1761.49 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1760.30 | step_microstep: 2.60
[2025-11-06 18:56:14,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.00 | bwd: 1762.60 | bwd_inner: 2.05 | bwd_allreduce: 1760.36 | step: 2.70
 82%|████████▏ | 2890/3507 [1:11:28<18:56,  1.84s/it]                                                     {'loss': 0.1877, 'learning_rate': 1.5806612919881726e-06, 'epoch': 0.82}
 82%|████████▏ | 2890/3507 [1:11:28<18:56,  1.84s/it]tensor([[-3.5156, -0.5430,  2.5312,  0.0254, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -5.0000, -0.8633,  0.9648, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:15,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.97 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.6250, -6.0938, -2.5781,  2.4219, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -4.4375, -1.5312,  2.9219, -1.1328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -1.6719,  1.1172, -1.9922, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -4.6250,  0.0236,  2.6094, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -2.5469,  1.8594,  2.7031, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9688,  2.4531,  4.4062, -1.7500, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:15,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 18:56:15,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.34 | bwd_microstep: 1.60 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.62
[2025-11-06 18:56:15,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 474.34 | bwd: 2.23 | bwd_inner: 1.30 | bwd_allreduce: 0.80 | step: 1.70
 82%|████████▏ | 2891/3507 [1:11:29<14:49,  1.44s/it]                                                     {'loss': 0.1269, 'learning_rate': 1.575680636410134e-06, 'epoch': 0.82}
 82%|████████▏ | 2891/3507 [1:11:29<14:49,  1.44s/it]tensor([[-2.5625, -3.7812, -2.5781,  2.2969,  0.0928]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -2.3750,  1.7812,  0.7969, -3.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -4.4688, -0.5312,  3.0781, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -3.5312,  0.6406,  2.9062, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:15,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.37 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3594,  0.6406,  2.6094, -2.0469, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -3.1875,  0.5664,  2.0938, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -4.5625, -0.9766,  2.9531, -1.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.2500, -4.7188,  0.8125,  0.4883, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:18,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:56:18,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.95 | bwd_microstep: 2334.29 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 2333.44 | step_microstep: 1.79
[2025-11-06 18:56:18,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.34 | bwd: 2335.04 | bwd_inner: 1.42 | bwd_allreduce: 2333.48 | step: 1.88
 82%|████████▏ | 2892/3507 [1:11:31<18:40,  1.82s/it]                                                     {'loss': 0.1897, 'learning_rate': 1.5707071690506504e-06, 'epoch': 0.82}
 82%|████████▏ | 2892/3507 [1:11:31<18:40,  1.82s/it]tensor([[-4.6562, -2.2969,  1.6719,  0.7891, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.5000,  1.3672,  2.9688, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.9844,  1.7188,  0.4395, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:18,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.89 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.8125, -3.9844,  0.7227,  1.5469, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969, -1.6562,  0.1973,  0.3613, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -4.6250, -0.7461,  3.1250, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.3125e+00, -6.9375e+00, -1.8906e+00, -5.8594e-03, -5.4688e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.7969,  1.3359,  2.3438,  1.1016, -0.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:18,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:56:18,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.04 | bwd_microstep: 56.33 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 55.46 | step_microstep: 1.63
[2025-11-06 18:56:18,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 329.93 | bwd: 57.19 | bwd_inner: 1.59 | bwd_allreduce: 55.49 | step: 1.70
 82%|████████▏ | 2893/3507 [1:11:32<14:19,  1.40s/it]                                                     {'loss': 0.4561, 'learning_rate': 1.5657408941534303e-06, 'epoch': 0.82}
 82%|████████▏ | 2893/3507 [1:11:32<14:19,  1.40s/it]tensor([[-2.9688, -3.2344, -1.0391,  2.3906, -0.7383]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -3.4531,  0.3770,  1.0547, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:18,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.34 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3125, -0.5977,  3.1406, -0.6172, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -5.2188, -1.2891,  2.2031, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1562, -1.4062,  0.7656,  2.3594, -0.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.7500, -5.3438,  0.5273, -1.4531, -7.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -3.4688,  0.5781,  2.0312, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.2812,  2.3125,  0.5000, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:20,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.22
[2025-11-06 18:56:20,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.64 | bwd_microstep: 1488.16 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1486.95 | step_microstep: 2.43
[2025-11-06 18:56:20,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.00 | bwd: 1488.96 | bwd_inner: 1.80 | bwd_allreduce: 1487.00 | step: 2.51
 83%|████████▎ | 2894/3507 [1:11:34<15:40,  1.53s/it]                                                     {'loss': 0.1715, 'learning_rate': 1.5607818159560473e-06, 'epoch': 0.83}
 83%|████████▎ | 2894/3507 [1:11:34<15:40,  1.53s/it]tensor([[-2.0469, -2.3125, -0.7734,  2.7500,  0.0439]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.5938,  0.3652,  2.3594, -0.5977, -2.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5469, -2.6250,  1.3594,  3.8281, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -2.2188,  1.4219, -1.2891, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:20,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.52 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.9219, -3.7969, -2.0938,  2.2812, -0.3652]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438,  0.5430,  2.5156, -0.6172, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6406, -0.5938,  1.6562,  1.1562, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6250, -4.0312,  0.1416,  1.1328, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:56:20,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:56:20,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.67 | bwd_microstep: 9.36 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 8.28 | step_microstep: 2.09
[2025-11-06 18:56:20,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.21 | bwd: 10.25 | bwd_inner: 1.73 | bwd_allreduce: 8.34 | step: 2.20
 83%|████████▎ | 2895/3507 [1:11:34<12:03,  1.18s/it]                                                     {'loss': 0.933, 'learning_rate': 1.5558299386899333e-06, 'epoch': 0.83}
 83%|████████▎ | 2895/3507 [1:11:34<12:03,  1.18s/it]tensor([[-1.6328,  2.3281,  3.0781, -1.8516, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -2.9062,  1.3203,  1.6641, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -4.1562, -0.6250,  1.3438, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.2500, -3.1875,  0.4629,  2.2812, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([3], device='cuda:0')
[2025-11-06 18:56:20,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.55 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4531, -3.1719, -0.3066,  2.4688, -1.3984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7812, -4.6875, -0.2578,  2.2188, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -2.6562,  1.1562,  0.4102, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -3.3906, -0.8789,  1.9062, -1.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:24,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.25 | optimizer_step: 0.36
[2025-11-06 18:56:24,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.44 | bwd_microstep: 3508.33 | bwd_inner_microstep: 4.52 | bwd_allreduce_microstep: 3503.71 | step_microstep: 4.10
[2025-11-06 18:56:24,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.00 | bwd: 3509.28 | bwd_inner: 5.35 | bwd_allreduce: 3503.78 | step: 4.20
 83%|████████▎ | 2896/3507 [1:11:38<20:23,  2.00s/it]                                                     {'loss': 0.5304, 'learning_rate': 1.5508852665803776e-06, 'epoch': 0.83}
 83%|████████▎ | 2896/3507 [1:11:38<20:23,  2.00s/it]tensor([[-2.8125,  1.0156,  3.3906, -1.4609, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8125, -2.6094,  1.0859,  5.0312, -0.3867]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -3.2812,  1.2656,  1.8984, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -2.9688,  2.0625,  1.2812, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:24,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8125, -5.1875, -0.6953,  2.7031, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -1.5000,  2.0938,  0.0898, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -2.2656,  2.2500, -0.7578, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -1.3516,  3.2656, -0.3633, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:25,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:56:25,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.53 | bwd_microstep: 1.57 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.78
[2025-11-06 18:56:25,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 476.36 | bwd: 2.31 | bwd_inner: 1.41 | bwd_allreduce: 0.76 | step: 1.86
 83%|████████▎ | 2897/3507 [1:11:38<15:52,  1.56s/it]                                                     {'loss': 0.1532, 'learning_rate': 1.5459478038465158e-06, 'epoch': 0.83}
 83%|████████▎ | 2897/3507 [1:11:38<15:52,  1.56s/it]tensor([[-1.9688,  0.5039,  0.9531, -1.2031, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.7266, -2.4219, -1.6016,  1.9688,  0.2988]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -0.9297,  3.3125, -0.5859, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -4.2188,  1.8281,  1.5000, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:25,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.24 | bwd_microstep: 2.90 | bwd_inner_microstep: 2.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7188, -4.7500,  0.2539,  2.9219, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.3516,  2.8125,  0.4668, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9375, -1.9219,  2.8281, -1.2422, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -1.4297,  1.9297,  0.0056, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:28,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.28 | optimizer_step: 0.24
[2025-11-06 18:56:28,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.94 | bwd_microstep: 3394.69 | bwd_inner_microstep: 6.43 | bwd_allreduce_microstep: 3388.15 | step_microstep: 2.63
[2025-11-06 18:56:28,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.22 | bwd: 3397.60 | bwd_inner: 9.22 | bwd_allreduce: 3388.21 | step: 2.72
 83%|████████▎ | 2898/3507 [1:11:42<22:44,  2.24s/it]                                                     {'loss': 0.2809, 'learning_rate': 1.5410175547013461e-06, 'epoch': 0.83}
 83%|████████▎ | 2898/3507 [1:11:42<22:44,  2.24s/it]tensor([[-2.5781,  1.2344,  1.2969, -3.4375, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -3.2344,  2.1406,  0.7344, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:29,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.73 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1875, -4.1250,  0.0579,  2.2188, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -3.6719,  0.4531,  1.1953, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -5.7812, -1.6953,  2.4062, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -4.9062, -1.3516,  1.7891, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -2.9531,  1.0703,  4.1562, -1.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4375, -4.0312,  1.6719,  1.7578, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:56:29,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:56:29,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.73 | bwd_microstep: 112.93 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 111.79 | step_microstep: 1.86
[2025-11-06 18:56:29,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.44 | bwd: 113.82 | bwd_inner: 1.86 | bwd_allreduce: 111.83 | step: 1.93
 83%|████████▎ | 2899/3507 [1:11:43<17:23,  1.72s/it]                                                     {'loss': 0.2863, 'learning_rate': 1.5360945233516933e-06, 'epoch': 0.83}
 83%|████████▎ | 2899/3507 [1:11:43<17:23,  1.72s/it]tensor([[-3.1719, -0.5156,  2.1719,  0.2617, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1406, -2.6250, -1.7500,  1.2969, -0.2080]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -4.6562, -1.0156,  0.8516, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:29,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.68 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9375, -3.8281, -0.4297,  3.1719, -1.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -3.3125, -0.5586,  2.6406, -1.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -4.8125,  0.1572,  3.8438, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -0.2422,  3.2500, -2.5625, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5312, -3.9531,  0.8164,  2.5625, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:31,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 18:56:31,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.53 | bwd_microstep: 1910.89 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1909.71 | step_microstep: 2.05
[2025-11-06 18:56:31,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 325.23 | bwd: 1911.82 | bwd_inner: 1.93 | bwd_allreduce: 1909.76 | step: 2.13
 83%|████████▎ | 2900/3507 [1:11:45<19:02,  1.88s/it]                                                     {'loss': 0.1213, 'learning_rate': 1.531178713998235e-06, 'epoch': 0.83}
 83%|████████▎ | 2900/3507 [1:11:45<19:02,  1.88s/it]tensor([[-5.1250, -4.2812,  0.1216,  2.5000, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -4.8438, -1.1484,  2.5625, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -4.5312, -0.4414,  2.3594, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219, -4.8438, -3.5938,  0.7070, -1.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:32,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.11 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4375, -4.5312, -0.8633,  3.4219, -1.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -3.9375, -0.2773,  3.2344, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -3.7656, -0.1553,  2.8438, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -3.1719,  2.1094,  2.1094, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:32,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:56:32,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.45 | bwd_microstep: 30.92 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 29.46 | step_microstep: 1.61
[2025-11-06 18:56:32,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.58 | bwd: 31.91 | bwd_inner: 2.28 | bwd_allreduce: 29.50 | step: 1.69
 83%|████████▎ | 2901/3507 [1:11:46<14:54,  1.48s/it]                                                     {'loss': 0.1484, 'learning_rate': 1.526270130835481e-06, 'epoch': 0.83}
 83%|████████▎ | 2901/3507 [1:11:46<14:54,  1.48s/it]tensor([[-5.3438, -1.3984,  3.3281, -0.4688, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -2.6094,  1.7188,  0.3398, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -2.8281,  1.8906, -0.3047, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2031,  0.7031,  2.6875, -1.5703, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:32,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.15 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5938, -2.7969,  0.4707,  2.5469, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -2.1719,  3.0781,  0.1006, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2969,  1.0391,  3.4062, -2.1250, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.3125, -0.8281,  1.6641, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:34,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.21 | optimizer_step: 0.25
[2025-11-06 18:56:34,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.72 | bwd_microstep: 1885.77 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 1884.81 | step_microstep: 2.60
[2025-11-06 18:56:34,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.90 | bwd: 1886.44 | bwd_inner: 1.43 | bwd_allreduce: 1884.86 | step: 2.69
 83%|████████▎ | 2902/3507 [1:11:48<17:15,  1.71s/it]                                                     {'loss': 0.1136, 'learning_rate': 1.5213687780517827e-06, 'epoch': 0.83}
 83%|████████▎ | 2902/3507 [1:11:48<17:15,  1.71s/it]tensor([[-1.9062,  1.5234,  2.4844, -1.3672, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844, -1.1797,  2.7031,  3.4219, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -3.0156,  1.1172, -0.8203, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -2.5156, -0.1216,  0.9062, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:34,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.66 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.3750, -4.4375,  1.5156,  0.4668, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250,  0.6406,  3.7344, -1.6875, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.8125, -2.4531,  1.4531,  0.3926, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.5312, -4.1562,  0.5625,  2.4844, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([3], device='cuda:2')
[2025-11-06 18:56:34,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:56:34,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.93 | bwd_microstep: 2.22 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1.11 | step_microstep: 2.20
[2025-11-06 18:56:34,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.62 | bwd: 2.93 | bwd_inner: 1.62 | bwd_allreduce: 1.14 | step: 2.29
 83%|████████▎ | 2903/3507 [1:11:48<13:20,  1.33s/it]                                                     {'loss': 0.9091, 'learning_rate': 1.5164746598293157e-06, 'epoch': 0.83}
 83%|████████▎ | 2903/3507 [1:11:48<13:20,  1.33s/it]tensor([[-7.5938, -5.8438, -0.0244,  1.6016, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[ 0.4824,  3.1250,  2.3750, -0.4824, -0.3672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:56:35,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.88 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.9688, -4.8438, -0.1670,  0.3926, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5312,  0.2168,  2.9219, -0.9336, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5625, -2.6719, -2.1406,  2.0156,  0.6680]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5312, -5.9375,  0.3555,  2.5156, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1875, -3.4219,  2.4219,  1.5000, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -0.3516,  2.5312, -2.7812, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 18:56:38,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:56:38,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.35 | bwd_microstep: 2953.26 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 2952.30 | step_microstep: 1.71
[2025-11-06 18:56:38,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.25 | bwd: 2954.13 | bwd_inner: 1.62 | bwd_allreduce: 2952.36 | step: 1.81
 83%|████████▎ | 2904/3507 [1:11:52<19:30,  1.94s/it]                                                     {'loss': 0.7219, 'learning_rate': 1.5115877803440836e-06, 'epoch': 0.83}
 83%|████████▎ | 2904/3507 [1:11:52<19:30,  1.94s/it]tensor([[-5.0000, -4.6875, -1.9922,  0.8945, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -0.4414,  1.9844, -0.3770, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:56:38,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.70 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7500, -4.7500,  1.3750,  2.5938, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -0.5391,  1.1406, -4.1875, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.0938, -4.8125, -0.1006,  2.0156, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -1.3516,  2.3438,  2.6250, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0000, -0.2451,  3.7969, -2.1562, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0312, -5.3438,  0.3926,  2.1562, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:56:38,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:56:38,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.66 | bwd_microstep: 198.66 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 197.65 | step_microstep: 1.49
[2025-11-06 18:56:38,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.39 | bwd: 199.74 | bwd_inner: 1.91 | bwd_allreduce: 197.69 | step: 1.57
 83%|████████▎ | 2905/3507 [1:11:52<15:24,  1.54s/it]                                                     {'loss': 0.864, 'learning_rate': 1.5067081437659093e-06, 'epoch': 0.83}
 83%|████████▎ | 2905/3507 [1:11:52<15:24,  1.54s/it]tensor([[-4.7500, -3.7656,  0.3223,  2.4531, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625,  0.4512,  2.6562,  0.3672, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -4.5938, -0.4785,  1.2969, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:39,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.51 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.9688, -4.9688, -0.3340,  0.1914, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -1.3438,  2.0156, -1.2812, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -5.3438, -1.3594,  1.6406, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.3750, -0.4746,  3.0469, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8438, -4.2500, -0.8633,  3.7812, -0.9570]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:40,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 18:56:40,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.72 | bwd_microstep: 1170.75 | bwd_inner_microstep: 1.42 | bwd_allreduce_microstep: 1169.24 | step_microstep: 2.34
[2025-11-06 18:56:40,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.27 | bwd: 1171.61 | bwd_inner: 2.19 | bwd_allreduce: 1169.29 | step: 2.42
 83%|████████▎ | 2906/3507 [1:11:54<15:31,  1.55s/it]                                                     {'loss': 0.1431, 'learning_rate': 1.5018357542584461e-06, 'epoch': 0.83}
 83%|████████▎ | 2906/3507 [1:11:54<15:31,  1.55s/it]tensor([[-5.1875, -5.4375, -2.1406,  2.2344, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -4.3750,  0.6016,  3.0469, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:40,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.95 | bwd_microstep: 5.04 | bwd_inner_microstep: 4.81 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.16
tensor([[-4.1875, -3.3906,  0.4043,  2.7031, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0312, -4.4688, -1.1094,  3.5469, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2500, -2.6719,  2.7188,  0.0154, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6250, -4.6562,  0.8477,  1.5078, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5938e+00, -5.6562e+00, -8.5938e-01, -5.7983e-04, -5.2500e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0625, -5.9375, -0.1445,  2.7500, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:56:41,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:56:41,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.54 | bwd_microstep: 165.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 165.00 | step_microstep: 2.60
[2025-11-06 18:56:41,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.50 | bwd: 170.88 | bwd_inner: 5.57 | bwd_allreduce: 165.10 | step: 2.76
 83%|████████▎ | 2907/3507 [1:11:54<12:43,  1.27s/it]                                                     {'loss': 0.4302, 'learning_rate': 1.4969706159791564e-06, 'epoch': 0.83}
 83%|████████▎ | 2907/3507 [1:11:54<12:43,  1.27s/it]tensor([[-7.5000, -5.6250,  0.4434,  1.8984, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:41,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.62 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -2.6719,  2.7500,  0.5508, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9375, -3.1094,  1.5625,  0.5039, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125, -3.3438, -0.3848,  2.0156, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7812, -3.0938,  0.4609,  1.0000, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9375, -3.5938,  0.5469, -0.2246, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5625,  1.2969,  3.5625, -1.2578, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -5.6250, -1.5000,  2.2344, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:42,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.22 | optimizer_step: 0.25
[2025-11-06 18:56:42,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.18 | bwd_microstep: 584.52 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 583.65 | step_microstep: 2.51
[2025-11-06 18:56:42,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.83 | bwd: 585.40 | bwd_inner: 1.56 | bwd_allreduce: 583.70 | step: 2.59
 83%|████████▎ | 2908/3507 [1:11:55<11:46,  1.18s/it]                                                     {'loss': 0.22, 'learning_rate': 1.4921127330793138e-06, 'epoch': 0.83}
 83%|████████▎ | 2908/3507 [1:11:55<11:46,  1.18s/it]tensor([[-5.5938, -2.4531,  1.4453, -0.6211, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500, -1.7734,  0.7695,  1.0938, -2.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8125, -5.4375, -0.5352,  1.3672, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -2.9219,  1.5547,  1.9297, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -4.2188, -0.9492,  3.0000, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:42,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.88 | bwd_microstep: 1.16 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.0625,  0.3555,  3.2812,  0.1494, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -1.9062,  2.4062, -0.5430, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8906, -3.7656, -0.3105,  3.3281, -1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:56:42,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 18:56:42,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 199.30 | bwd_microstep: 178.23 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 177.34 | step_microstep: 2.13
[2025-11-06 18:56:42,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.17 | bwd: 179.40 | bwd_inner: 1.84 | bwd_allreduce: 177.40 | step: 2.23
 83%|████████▎ | 2909/3507 [1:11:56<10:56,  1.10s/it]                                                     {'loss': 0.4644, 'learning_rate': 1.4872621097040074e-06, 'epoch': 0.83}
 83%|████████▎ | 2909/3507 [1:11:56<10:56,  1.10s/it]tensor([[-3.4688, -1.0469,  1.8203,  0.5117, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -5.9062, -1.4141,  1.3203, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -5.4062, -2.7031,  1.7891, -1.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:43,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.39 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.52 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0469, -2.5156,  2.0469,  5.6875, -0.6367]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -2.3906,  2.7188, -0.4102, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4688,  0.0952,  2.9844, -0.9414, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -4.3125,  0.9805,  1.1875, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -1.4062,  2.0469, -0.6367, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:46,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:56:46,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.89 | bwd_microstep: 3290.85 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 3289.97 | step_microstep: 2.31
[2025-11-06 18:56:46,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.29 | bwd: 3291.48 | bwd_inner: 1.31 | bwd_allreduce: 3290.02 | step: 2.39
 83%|████████▎ | 2910/3507 [1:12:00<18:21,  1.85s/it]                                                     {'loss': 0.6865, 'learning_rate': 1.482418749992125e-06, 'epoch': 0.83}
 83%|████████▎ | 2910/3507 [1:12:00<18:21,  1.85s/it]tensor([[-4.5312, -0.1348,  3.5469, -1.6250, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -2.2812,  3.0938, -0.6992, -5.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -2.6406,  1.6484,  0.8555, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:46,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.81 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13
tensor([[-5.8125, -2.7969,  2.9375,  1.5703, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -0.8477,  2.8750, -0.3555, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9141,  1.5625,  3.9375,  0.2500, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -0.8672,  4.0625, -0.6797, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -2.5000,  1.6094,  0.0134, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:56:47,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:56:47,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.37 | bwd_microstep: 37.97 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 36.77 | step_microstep: 1.80
[2025-11-06 18:56:47,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.21 | bwd: 39.04 | bwd_inner: 1.98 | bwd_allreduce: 36.85 | step: 1.93
 83%|████████▎ | 2911/3507 [1:12:00<14:14,  1.43s/it]                                                     {'loss': 0.4325, 'learning_rate': 1.477582658076362e-06, 'epoch': 0.83}
 83%|████████▎ | 2911/3507 [1:12:00<14:14,  1.43s/it]tensor([[-5.0938, -4.9375, -1.2578,  2.3125, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7656, -3.7812,  0.0977,  4.4375, -0.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:47,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.17 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -2.7031,  1.1406,  2.4375, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -3.2656,  1.1406,  1.6797, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -4.0625,  0.9727,  0.9062, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -3.8750,  2.7500,  1.0625, -5.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -0.2197,  3.4375, -1.3203, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.1875, -6.1875,  0.0913,  1.5312, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:48,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:56:48,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.74 | bwd_microstep: 1058.43 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1057.25 | step_microstep: 2.44
[2025-11-06 18:56:48,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.94 | bwd: 1059.30 | bwd_inner: 1.86 | bwd_allreduce: 1057.30 | step: 2.52
 83%|████████▎ | 2912/3507 [1:12:02<14:21,  1.45s/it]                                                     {'loss': 0.4527, 'learning_rate': 1.4727538380832095e-06, 'epoch': 0.83}
 83%|████████▎ | 2912/3507 [1:12:02<14:21,  1.45s/it]tensor([[-6.4062, -2.6875,  2.3438, -0.6328, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -4.7812, -2.0000,  2.1562, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -3.0156,  0.0552,  2.6719, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -1.6094,  3.3438, -1.5078, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1250,  0.8242,  3.0625,  0.4883, -2.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:49,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.22 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -3.0156,  2.3281,  0.8828, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -1.5859,  0.1934, -2.1562, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -3.7500,  0.5664,  1.3125, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:56:50,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.26
[2025-11-06 18:56:50,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.80 | bwd_microstep: 709.82 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 708.55 | step_microstep: 2.12
[2025-11-06 18:56:50,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 462.05 | bwd: 710.84 | bwd_inner: 2.08 | bwd_allreduce: 708.60 | step: 2.21
 83%|████████▎ | 2913/3507 [1:12:03<14:49,  1.50s/it]                                                     {'loss': 0.4294, 'learning_rate': 1.4679322941329522e-06, 'epoch': 0.83}
 83%|████████▎ | 2913/3507 [1:12:03<14:49,  1.50s/it]tensor([[-7.0312, -4.6562, -0.5234, -0.9023, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1719,  1.6875,  2.0625, -1.0547, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.3750, -4.9375, -1.0781,  2.4688, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:50,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.55 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9062,  0.0471,  2.2969, -0.7383, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -5.7500, -1.6484,  3.2188, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -0.2412,  3.2656, -1.2734, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -1.6094,  3.6562,  0.5508, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9805,  3.3438,  5.3438, -0.4199, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:51,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:56:51,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.96 | bwd_microstep: 1192.97 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1191.78 | step_microstep: 1.91
[2025-11-06 18:56:51,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.52 | bwd: 1193.86 | bwd_inner: 1.88 | bwd_allreduce: 1191.82 | step: 2.00
 83%|████████▎ | 2914/3507 [1:12:05<15:01,  1.52s/it]                                                     {'loss': 0.6191, 'learning_rate': 1.4631180303396742e-06, 'epoch': 0.83}
 83%|████████▎ | 2914/3507 [1:12:05<15:01,  1.52s/it]tensor([[-4.9062, -1.4531,  1.9297, -1.1797, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -2.5312,  1.7969,  1.7031, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -2.0938,  2.7344, -0.0190, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:51,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.36 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.18
tensor([[4.8438, 5.8438, 7.3438, 8.1250, 4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2500, -4.1562,  0.0562, -1.3281, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -3.1250,  2.4531,  2.3438, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -5.0312, -1.6719,  2.3750, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -3.1250,  1.1328,  2.1719, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:56:52,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:56:52,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.29 | bwd_microstep: 632.61 | bwd_inner_microstep: 4.88 | bwd_allreduce_microstep: 627.64 | step_microstep: 1.63
[2025-11-06 18:56:52,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 450.69 | bwd: 633.64 | bwd_inner: 5.83 | bwd_allreduce: 627.67 | step: 1.80
 83%|████████▎ | 2915/3507 [1:12:06<13:49,  1.40s/it]                                                     {'loss': 0.5553, 'learning_rate': 1.4583110508112396e-06, 'epoch': 0.83}
 83%|████████▎ | 2915/3507 [1:12:06<13:49,  1.40s/it]tensor([[-1.6016,  2.3438,  3.9688, -0.8828, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:53,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.77 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8750, -4.5625, -0.8711,  2.4688, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781, -0.3047,  2.2969, -0.3008, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -1.7422,  2.0625,  1.4141, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2344,  2.2344,  3.3750, -0.6250, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969,  1.4453,  2.7344, -1.7656, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -1.4922,  2.1250, -2.0156, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2188, -4.1875, -1.0547,  2.4219, -1.7266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:55,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:56:55,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.17 | bwd_microstep: 1783.59 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1782.49 | step_microstep: 1.65
[2025-11-06 18:56:55,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.97 | bwd: 1784.44 | bwd_inner: 1.78 | bwd_allreduce: 1782.53 | step: 1.73
 83%|████████▎ | 2916/3507 [1:12:08<16:02,  1.63s/it]                                                     {'loss': 0.1841, 'learning_rate': 1.4535113596492977e-06, 'epoch': 0.83}
 83%|████████▎ | 2916/3507 [1:12:08<16:02,  1.63s/it]tensor([[-7.8438, -6.8438, -1.4609,  1.6250, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2188,  0.6797,  3.0938, -1.4219, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:55,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.43 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2812,  1.2891,  1.2578, -3.0781, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.3125, -1.7109,  2.6406,  1.6641, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -0.6250,  3.8125, -1.9609, -5.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[5.0312, 6.7500, 6.9375, 5.9062, 4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4375,  1.9297,  2.4531, -3.6094, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2812, -1.6719,  2.3594, -0.6758, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:55,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:56:55,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.19 | bwd_microstep: 31.22 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 30.21 | step_microstep: 2.82
[2025-11-06 18:56:55,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.65 | bwd: 31.96 | bwd_inner: 1.59 | bwd_allreduce: 30.23 | step: 2.90
 83%|████████▎ | 2917/3507 [1:12:09<12:18,  1.25s/it]                                                     {'loss': 0.3909, 'learning_rate': 1.4487189609492802e-06, 'epoch': 0.83}
 83%|████████▎ | 2917/3507 [1:12:09<12:18,  1.25s/it]tensor([[-4.0625, -3.6250, -0.5117,  2.2031, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -1.4141,  1.9453, -1.6016, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -3.9219,  0.5586,  2.9844, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9062, -3.9375,  2.3438,  1.4531, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:56:55,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.53 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.1875, -3.5469,  1.7734,  1.0938, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -4.3438,  0.7266,  1.3750, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000e+00, -4.5000e+00, -2.5940e-04,  2.3594e+00, -3.0781e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938,  0.2754,  3.0469, -2.1406, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:58,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.17 | optimizer_step: 0.15
[2025-11-06 18:56:58,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.73 | bwd_microstep: 2165.15 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 2163.95 | step_microstep: 1.89
[2025-11-06 18:56:58,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 486.30 | bwd: 2166.28 | bwd_inner: 2.14 | bwd_allreduce: 2164.00 | step: 1.98
 83%|████████▎ | 2918/3507 [1:12:11<16:33,  1.69s/it]                                                     {'loss': 0.3638, 'learning_rate': 1.4439338588004005e-06, 'epoch': 0.83}
 83%|████████▎ | 2918/3507 [1:12:11<16:33,  1.69s/it]tensor([[-3.7344, -2.3594,  1.2344,  2.3750, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -4.5938,  0.1738,  2.1875, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:56:58,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.78 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.5938, -3.9062, -0.3730,  1.9766, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -4.7500, -2.0938,  2.0781, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -3.0781,  1.5156,  0.6406, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906,  0.2148,  2.8125, -1.1172, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438, -5.0625,  0.7617,  2.1875, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -2.0625,  2.9375, -0.3418, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:56:58,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 18:56:58,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.17 | bwd_microstep: 49.89 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 48.60 | step_microstep: 1.70
[2025-11-06 18:56:58,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 450.92 | bwd: 50.80 | bwd_inner: 2.05 | bwd_allreduce: 48.62 | step: 1.77
 83%|████████▎ | 2919/3507 [1:12:12<13:09,  1.34s/it]                                                     {'loss': 0.4135, 'learning_rate': 1.4391560572856412e-06, 'epoch': 0.83}
 83%|████████▎ | 2919/3507 [1:12:12<13:09,  1.34s/it]tensor([[-3.0938, -2.4375,  1.1250,  3.8594, -1.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7031,  1.1484,  3.1094, -1.6016, -3.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2812, -2.3281,  1.6953,  2.1250, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([[-4.3750, -1.1641,  3.3438,  0.8477, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)  tensor([3], device='cuda:3')
tensor([3], device='cuda:0')
tensor([[-5.9375, -3.0000,  2.2188,  0.5664, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:56:58,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.90 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5625, -4.9688, -0.6797,  2.4688, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7656, -0.5664,  2.8281,  6.1562,  0.9805]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -0.4473,  3.4531, -1.1953, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:00,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.27 | optimizer_step: 0.38
[2025-11-06 18:57:00,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.85 | bwd_microstep: 1417.01 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1416.13 | step_microstep: 3.05
[2025-11-06 18:57:00,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.78 | bwd: 1417.72 | bwd_inner: 1.38 | bwd_allreduce: 1416.19 | step: 3.13
 83%|████████▎ | 2920/3507 [1:12:14<14:37,  1.49s/it]                                                     {'loss': 1.0027, 'learning_rate': 1.434385560481758e-06, 'epoch': 0.83}
 83%|████████▎ | 2920/3507 [1:12:14<14:37,  1.49s/it]tensor([[-5.5625, -3.9844,  0.1006,  1.1406, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:00,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.74 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.2500, -1.5938,  2.3438, -0.8281, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -2.1406,  2.2656,  0.4199, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0938, -2.0312,  1.3828,  2.9844, -1.4922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1250, -2.1094, -2.2344,  1.1953,  0.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.7812, -3.9219,  1.3984,  2.5000, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406, -1.9219,  2.0938,  3.1875, -2.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0000, -2.0312,  1.6562, -0.3965, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:57:00,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:57:00,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.94 | bwd_microstep: 60.18 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 59.13 | step_microstep: 2.85
[2025-11-06 18:57:00,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.70 | bwd: 61.26 | bwd_inner: 1.96 | bwd_allreduce: 59.17 | step: 2.93
 83%|████████▎ | 2921/3507 [1:12:14<11:29,  1.18s/it]                                                     {'loss': 0.3135, 'learning_rate': 1.4296223724592662e-06, 'epoch': 0.83}
 83%|████████▎ | 2921/3507 [1:12:14<11:29,  1.18s/it]tensor([[-7.2812, -3.9219,  1.8828,  0.0737, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:01,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.77 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.0312, -4.5312,  0.0530,  1.6172, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -0.9648,  2.2812, -0.6523, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7266,  1.9297,  3.6875, -0.5508, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -4.7812, -0.2383,  1.6875, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500,  1.1016,  4.8438, -0.1914, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.6172,  2.2344,  2.6406, -2.7344, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -4.2188, -0.3281,  1.6719, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:57:04,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:57:04,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.09 | bwd_microstep: 3463.47 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 3462.61 | step_microstep: 1.59
[2025-11-06 18:57:04,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.88 | bwd: 3464.33 | bwd_inner: 1.57 | bwd_allreduce: 3462.65 | step: 1.66
 83%|████████▎ | 2922/3507 [1:12:18<19:13,  1.97s/it]                                                     {'loss': 0.4216, 'learning_rate': 1.4248664972824578e-06, 'epoch': 0.83}
 83%|████████▎ | 2922/3507 [1:12:18<19:13,  1.97s/it]tensor([[-4.7500, -3.7031,  0.3535,  2.2500, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:04,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.73 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3438, -5.0000,  0.3984,  2.7500, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2812, -4.7188,  0.1992,  1.8516, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1875, -3.9531,  2.2031,  0.8477, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -5.7188, -1.5156,  1.9844, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1250,  1.3281,  1.0391, -1.2578, -1.3984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5938, -4.0312,  0.0320,  3.3125, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -2.6406,  2.7188,  0.6758, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:57:05,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:57:05,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.07 | bwd_microstep: 110.01 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 109.19 | step_microstep: 1.72
[2025-11-06 18:57:05,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.83 | bwd: 110.93 | bwd_inner: 1.55 | bwd_allreduce: 109.23 | step: 1.80
 83%|████████▎ | 2923/3507 [1:12:19<14:48,  1.52s/it]                                                     {'loss': 0.1923, 'learning_rate': 1.4201179390093766e-06, 'epoch': 0.83}
 83%|████████▎ | 2923/3507 [1:12:19<14:48,  1.52s/it]tensor([[-1.8828, -2.7656, -2.2500,  1.3203,  0.1904]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.0625, -3.7812, -2.0312,  2.3438, -0.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -2.9531,  0.1963,  2.7812, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:05,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.09 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.5625, -2.7188,  0.7695,  1.0391, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -1.3672,  1.5859, -3.0156, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -3.5625,  1.2266,  1.8438, -3.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -4.7812, -0.8359,  2.6094, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4375, -3.0625,  2.2031, -0.0491, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:57:05,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:57:05,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.53 | bwd_microstep: 62.84 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 61.80 | step_microstep: 1.53
[2025-11-06 18:57:05,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.67 | bwd: 63.77 | bwd_inner: 1.73 | bwd_allreduce: 61.86 | step: 1.64
 83%|████████▎ | 2924/3507 [1:12:19<11:39,  1.20s/it]                                                     {'loss': 0.4358, 'learning_rate': 1.415376701691823e-06, 'epoch': 0.83}
 83%|████████▎ | 2924/3507 [1:12:19<11:39,  1.20s/it]tensor([[-1.2578,  2.5781,  3.0312, -2.3750, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.6562, -4.9688, -0.3496,  0.7070, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688, -1.5000,  1.9141,  0.5781, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6875,  1.1641,  2.1094, -2.7812, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:57:05,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.94 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.6562, -4.5000,  0.0089,  2.2656, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2969, -3.2344, -2.5312,  1.4609,  0.0130]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.3438, -3.4844,  1.5391,  2.5156, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0000, -4.9375,  0.7656,  1.5703, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:06,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:57:06,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.69 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.94
[2025-11-06 18:57:06,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 455.66 | bwd: 2.50 | bwd_inner: 1.50 | bwd_allreduce: 0.84 | step: 2.03
 83%|████████▎ | 2925/3507 [1:12:19<09:41,  1.00it/s]                                                     {'loss': 0.8978, 'learning_rate': 1.4106427893753537e-06, 'epoch': 0.83}
 83%|████████▎ | 2925/3507 [1:12:19<09:41,  1.00it/s]tensor([[-5.6875, -3.7656,  1.4766,  2.3125, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -4.4062,  0.5664,  3.2344, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8438, -2.1719,  1.8125,  2.5312, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -2.0312,  2.2969,  0.9297, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:06,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.81 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.1875, -3.5312, -0.2275,  2.0156, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -4.1875, -0.0118,  1.9766, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -3.1719,  2.3750, -0.7812, -5.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -3.8594, -0.8984,  1.1172, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:08,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.34 | optimizer_step: 0.44
[2025-11-06 18:57:08,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.23 | bwd_microstep: 2301.96 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 2300.84 | step_microstep: 3.59
[2025-11-06 18:57:08,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.07 | bwd: 2302.87 | bwd_inner: 1.74 | bwd_allreduce: 2300.93 | step: 3.69
 83%|████████▎ | 2926/3507 [1:12:22<14:47,  1.53s/it]                                                     {'loss': 0.9287, 'learning_rate': 1.4059162060992736e-06, 'epoch': 0.83}
 83%|████████▎ | 2926/3507 [1:12:22<14:47,  1.53s/it]tensor([[-6.6562, -3.8281,  1.7656,  0.8906, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -3.6875, -0.6445,  2.8438, -1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -3.1406,  0.3320,  0.8398, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.9062, -5.9688,  0.2031,  1.6641, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:09,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.26 | bwd_microstep: 1.30 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6250, -1.1797,  3.4531,  0.6328, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -2.9375,  1.1953,  2.0469, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -1.0156,  2.8125, -1.2031, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -2.9688,  1.0938, -0.8438, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:09,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:57:09,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.63 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.16
[2025-11-06 18:57:09,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 424.91 | bwd: 3.32 | bwd_inner: 2.31 | bwd_allreduce: 0.90 | step: 2.23
 83%|████████▎ | 2927/3507 [1:12:23<12:13,  1.27s/it]                                                     {'loss': 0.6779, 'learning_rate': 1.4011969558966332e-06, 'epoch': 0.83}
 83%|████████▎ | 2927/3507 [1:12:23<12:13,  1.27s/it]tensor([[-5.0000, -4.3750,  0.0422,  3.5156, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2344, -1.8203,  1.1875,  3.7188, -0.4648]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4844, -1.7812,  0.7266,  0.8633, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -2.6875,  1.6094,  1.0859, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:09,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.37 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.0156,  0.7344,  3.1406, -1.1641, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500,  1.3516,  2.8125, -2.7500, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -5.7812, -1.1016,  2.4844, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.9062, -3.6406, -2.0469,  2.2500, -0.4121]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 18:57:11,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:57:11,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.98 | bwd_microstep: 1879.16 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1877.96 | step_microstep: 2.18
[2025-11-06 18:57:11,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.38 | bwd: 1880.23 | bwd_inner: 2.09 | bwd_allreduce: 1878.01 | step: 2.26
 83%|████████▎ | 2928/3507 [1:12:25<15:10,  1.57s/it]                                                     {'loss': 0.5497, 'learning_rate': 1.396485042794229e-06, 'epoch': 0.83}
 83%|████████▎ | 2928/3507 [1:12:25<15:10,  1.57s/it]tensor([[-3.3438,  0.6406,  2.9062, -1.9922, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -2.6875,  1.9062,  3.1406, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -1.1719,  3.8906, -0.5664, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:12,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.29 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-9.2500, -5.9375,  0.3906, -0.9453, -7.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -0.9648,  3.1875, -0.8633, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -4.2812, -1.4844,  2.3438, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.2656,  0.3320,  3.5781,  1.5859, -1.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -4.6250, -1.3984,  3.1406, -1.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:13,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:57:13,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.19 | bwd_microstep: 2.18 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.81
[2025-11-06 18:57:13,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.49 | bwd: 3.05 | bwd_inner: 2.07 | bwd_allreduce: 0.86 | step: 1.88
 84%|████████▎ | 2929/3507 [1:12:27<14:28,  1.50s/it]                                                     {'loss': 0.2655, 'learning_rate': 1.3917804708125903e-06, 'epoch': 0.84}
 84%|████████▎ | 2929/3507 [1:12:27<14:28,  1.50s/it]tensor([[-5.8438, -4.1562,  0.4082,  1.3906, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.1250,  0.5625,  1.8516, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -3.6094,  0.7773,  1.0078, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:13,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.94 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1719, -3.3438, -2.6719,  1.7266,  0.2090]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.0625, -1.2734,  2.6250, -1.1641, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -4.9375, -2.3750,  2.0469, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3750, -2.3281,  2.9375, -0.8359, -5.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1562, -3.2188, -0.8828,  2.4375, -0.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:57:14,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:57:14,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.73 | bwd_microstep: 842.40 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 841.15 | step_microstep: 1.66
[2025-11-06 18:57:14,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 438.70 | bwd: 843.31 | bwd_inner: 1.98 | bwd_allreduce: 841.19 | step: 1.74
 84%|████████▎ | 2930/3507 [1:12:28<13:55,  1.45s/it]                                                     {'loss': 0.3886, 'learning_rate': 1.387083243965992e-06, 'epoch': 0.84}
 84%|████████▎ | 2930/3507 [1:12:28<13:55,  1.45s/it]tensor([[-3.1406, -2.9062,  0.3535,  3.8906, -0.7773]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9219,  1.2578,  3.4531, -1.5625, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.8125, -5.4062, -2.1094,  0.8438, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:14,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.01 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.9688, -4.8750,  0.0933,  2.5312, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -3.4375,  0.2324,  3.4844, -1.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8438, -0.0991,  0.8789, -1.7500, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.4375, -5.4375, -1.5547,  2.3750, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -2.3594,  1.4922,  1.2891, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:15,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:57:15,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.75 | bwd_microstep: 433.82 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 432.77 | step_microstep: 1.73
[2025-11-06 18:57:15,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.79 | bwd: 434.67 | bwd_inner: 1.68 | bwd_allreduce: 432.83 | step: 1.83
 84%|████████▎ | 2931/3507 [1:12:29<11:59,  1.25s/it]                                                     {'loss': 0.5727, 'learning_rate': 1.3823933662624379e-06, 'epoch': 0.84}
 84%|████████▎ | 2931/3507 [1:12:29<11:59,  1.25s/it]tensor([[-3.6250,  0.5625,  2.1406, -2.9688, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:15,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 87.96 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.9062, -4.1562, -0.5703,  1.9141, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9062, -5.9375, -1.4766,  0.9766, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5859,  2.0469,  2.9688, -1.5312, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[2.0156, 5.4062, 4.9688, 0.1445, 0.2559]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.4688, -2.6562,  2.7812,  1.8438, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1562, -6.4688, -1.1797,  2.5000, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -4.2188, -1.1562,  2.3750, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:57:16,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:57:16,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.95 | bwd_microstep: 942.36 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 941.13 | step_microstep: 1.87
[2025-11-06 18:57:16,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 217.93 | bwd: 943.19 | bwd_inner: 1.88 | bwd_allreduce: 941.17 | step: 1.94
 84%|████████▎ | 2932/3507 [1:12:30<11:47,  1.23s/it]                                                     {'loss': 0.6326, 'learning_rate': 1.3777108417036544e-06, 'epoch': 0.84}
 84%|████████▎ | 2932/3507 [1:12:30<11:47,  1.23s/it]tensor([[-5.1562, -1.3203,  2.2812, -1.6328, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:16,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 73.60 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.5938, -3.4062,  1.7734,  2.2969, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -3.8750,  1.1797,  2.4062, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8398,  2.1719,  2.1094, -1.7188, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -0.9375,  3.3750, -1.1562, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -3.7500, -1.1953,  1.9297, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -5.6875, -0.9922,  2.8438, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -3.0625,  0.8047, -1.5703, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:57:17,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 18:57:17,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.49 | bwd_microstep: 1152.75 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1151.63 | step_microstep: 2.34
[2025-11-06 18:57:17,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 216.10 | bwd: 1153.68 | bwd_inner: 1.88 | bwd_allreduce: 1151.67 | step: 2.41
 84%|████████▎ | 2933/3507 [1:12:31<12:15,  1.28s/it]                                                     {'loss': 0.2231, 'learning_rate': 1.373035674285098e-06, 'epoch': 0.84}
 84%|████████▎ | 2933/3507 [1:12:31<12:15,  1.28s/it]tensor([[-2.9844, -3.7969, -1.2656,  3.6250, -0.2412]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062,  0.4980,  3.2344, -2.1250, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3438, -3.9688,  2.1875, -0.0356, -5.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:18,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.70 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.5938, -6.1562,  0.1104,  2.6719, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -4.5312,  0.2754,  3.2031, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -4.9688, -3.4219,  1.1797, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0938, -3.8438,  2.2344,  0.8945, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -2.9844,  2.3125,  0.7812, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:57:18,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.20 | optimizer_step: 0.31
[2025-11-06 18:57:18,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.21 | bwd_microstep: 246.06 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 244.75 | step_microstep: 2.37
[2025-11-06 18:57:18,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.95 | bwd: 247.01 | bwd_inner: 2.01 | bwd_allreduce: 244.81 | step: 2.47
 84%|████████▎ | 2934/3507 [1:12:32<10:32,  1.10s/it]                                                     {'loss': 0.7899, 'learning_rate': 1.3683678679959556e-06, 'epoch': 0.84}
 84%|████████▎ | 2934/3507 [1:12:32<10:32,  1.10s/it]tensor([[-5.9688, -2.0156,  2.7031, -1.2578, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.4688, -5.0312,  0.9844,  1.3438, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3125,  0.9297,  4.6875,  2.1250, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:18,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.45 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.4062, -3.0938,  1.4297,  1.2031, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -1.0469,  2.5469, -0.4277, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9180,  2.5938,  1.9141, -2.4219, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5000, -5.0312, -1.7031,  3.1250, -1.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406,  0.8008,  3.5781, -1.8281, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:21,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 18:57:21,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.12 | bwd_microstep: 1482.57 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 1481.66 | step_microstep: 2.17
[2025-11-06 18:57:21,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.60 | bwd: 1483.63 | bwd_inner: 1.74 | bwd_allreduce: 1481.73 | step: 2.28
 84%|████████▎ | 2935/3507 [1:12:35<14:54,  1.56s/it]                                                     {'loss': 0.2318, 'learning_rate': 1.3637074268191209e-06, 'epoch': 0.84}
 84%|████████▎ | 2935/3507 [1:12:35<14:54,  1.56s/it]tensor([[-5.0312, -3.8906,  0.5781,  2.8594, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -3.9375,  0.3848,  3.1406, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -4.8438, -0.4238,  2.1406, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4844, -2.8906, -0.3379,  1.8516, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.8125, -0.7500,  1.8672, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -5.5000, -1.0078,  2.4219, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -6.4688, -2.6719,  1.6406, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:22,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.12 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5312, -2.4375,  2.3281,  0.5234, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:22,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:57:22,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.55 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.81 | step_microstep: 3.15
[2025-11-06 18:57:22,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.69 | bwd: 2.59 | bwd_inner: 1.60 | bwd_allreduce: 0.84 | step: 3.23
 84%|████████▎ | 2936/3507 [1:12:36<14:13,  1.49s/it]                                                     {'loss': 0.6944, 'learning_rate': 1.3590543547312108e-06, 'epoch': 0.84}
 84%|████████▎ | 2936/3507 [1:12:36<14:13,  1.49s/it]tensor([[-7.2812, -4.4375,  1.7109,  1.0000, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -4.5938,  0.5117,  3.2031, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -3.9688,  0.6055,  4.0938, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8438, -4.2188, -1.5547,  2.5938, -1.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 18:57:22,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([3], device='cuda:1')
tensor([[-1.6406,  2.0469,  3.4219, -0.7578, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -3.8281,  0.7969,  2.7500, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -2.8594,  1.6250,  1.9062, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.5938, -0.2129,  2.7031, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:57:24,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 18:57:24,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.00 | bwd_microstep: 1693.45 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1692.37 | step_microstep: 1.71
[2025-11-06 18:57:24,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.49 | bwd: 1694.41 | bwd_inner: 1.86 | bwd_allreduce: 1692.41 | step: 1.80
 84%|████████▎ | 2937/3507 [1:12:38<15:55,  1.68s/it]                                                     {'loss': 0.2826, 'learning_rate': 1.3544086557025493e-06, 'epoch': 0.84}
 84%|████████▎ | 2937/3507 [1:12:38<15:55,  1.68s/it]tensor([[-3.8750,  0.3262,  2.9062, -1.9297, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:24,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.06 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7500, -2.3750,  2.0625,  1.8047, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7188, -2.2656,  1.4531,  2.5781, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -4.8750,  0.1846,  2.0469, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -4.3125, -0.8906,  3.0625, -1.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -2.9844,  2.5000,  1.9453, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.1875,  1.4219,  1.5859, -2.8750, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.7812, -2.6094,  0.4297,  1.4531, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:57:25,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:57:25,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 56.44 | bwd_microstep: 222.85 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 221.94 | step_microstep: 1.50
[2025-11-06 18:57:25,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 189.51 | bwd: 223.67 | bwd_inner: 1.57 | bwd_allreduce: 221.98 | step: 1.57
 84%|████████▍ | 2938/3507 [1:12:38<12:22,  1.30s/it]                                                     {'loss': 0.5849, 'learning_rate': 1.3497703336971746e-06, 'epoch': 0.84}
 84%|████████▍ | 2938/3507 [1:12:38<12:22,  1.30s/it]tensor([[-1.7344,  1.7344,  2.8281, -1.7109, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:25,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.03 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.3750, -5.5312, -2.2031,  2.1250, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -5.1562,  1.2266,  1.9141, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.1875, -0.4062,  2.2969, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.8750, -0.0869, -0.1709, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -4.0938,  0.8711,  3.2656, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.6250,  0.5938,  1.9062, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344,  0.5664,  2.8594, -1.7266, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:27,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 18:57:27,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.59 | bwd_microstep: 1683.62 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1682.46 | step_microstep: 2.03
[2025-11-06 18:57:27,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 290.65 | bwd: 1684.54 | bwd_inner: 1.86 | bwd_allreduce: 1682.52 | step: 2.14
 84%|████████▍ | 2939/3507 [1:12:41<14:53,  1.57s/it]                                                     {'loss': 0.4211, 'learning_rate': 1.3451393926728252e-06, 'epoch': 0.84}
 84%|████████▍ | 2939/3507 [1:12:41<14:53,  1.57s/it]tensor([[-2.9219,  0.4570,  2.5312, -0.7969, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -2.9375,  1.2656,  2.4375, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -4.3438,  0.7188,  0.2139, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:27,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.16 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.6562, -5.2500, -1.0078,  0.6406, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -3.3125,  1.1562,  2.6250, -2.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2969, -2.6250,  0.2012,  4.3125,  0.0718]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -4.2500, -1.5469,  1.7031, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -6.2188, -3.4844,  1.1719, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:57:28,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 18:57:28,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.40 | bwd_microstep: 698.94 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 697.86 | step_microstep: 2.02
[2025-11-06 18:57:28,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 434.58 | bwd: 699.97 | bwd_inner: 1.92 | bwd_allreduce: 697.91 | step: 2.11
 84%|████████▍ | 2940/3507 [1:12:42<13:44,  1.45s/it]                                                     {'loss': 0.5118, 'learning_rate': 1.3405158365809445e-06, 'epoch': 0.84}
 84%|████████▍ | 2940/3507 [1:12:42<13:44,  1.45s/it]tensor([[-5.6875, -4.9688, -0.8711,  1.9297, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -2.8281,  1.7422,  1.2734, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -3.4844,  0.6992,  3.5781, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:28,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.34 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.1562, -4.9062,  0.2188,  2.3906, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -3.5312,  1.2344,  2.0938, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -2.9844,  0.8477, -0.2891, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -5.1250, -1.7344,  1.7031, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -1.7656,  2.1562, -1.2578, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:30,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:57:30,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.30 | bwd_microstep: 1410.67 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1409.80 | step_microstep: 2.45
[2025-11-06 18:57:30,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.67 | bwd: 1411.66 | bwd_inner: 1.67 | bwd_allreduce: 1409.85 | step: 2.54
 84%|████████▍ | 2941/3507 [1:12:44<14:41,  1.56s/it]                                                     {'loss': 0.4355, 'learning_rate': 1.3358996693666748e-06, 'epoch': 0.84}
 84%|████████▍ | 2941/3507 [1:12:44<14:41,  1.56s/it]tensor([[-2.9688,  0.9766,  2.5781, -1.9766, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.5664, -1.1328, -0.1562,  3.0312,  1.0547]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.1562, -0.0444,  2.0000, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -2.1562,  1.8047,  1.6875, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -0.3652,  3.8281,  0.2578, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:30,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.05 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.20
tensor([[-4.8438, -4.5625, -0.1514,  3.6406, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000, -4.1875, -0.8477,  3.3906, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -3.0469,  0.6406,  0.8867, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:57:31,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:57:31,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.26 | bwd_microstep: 319.78 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 318.73 | step_microstep: 1.94
[2025-11-06 18:57:31,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.33 | bwd: 320.97 | bwd_inner: 2.04 | bwd_allreduce: 318.78 | step: 2.14
 84%|████████▍ | 2942/3507 [1:12:45<13:32,  1.44s/it]                                                     {'loss': 0.2487, 'learning_rate': 1.3312908949688497e-06, 'epoch': 0.84}
 84%|████████▍ | 2942/3507 [1:12:45<13:32,  1.44s/it]tensor([[-4.1250, -3.7656, -0.6211,  2.2500, -1.8672]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:31,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.17 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6562, -4.0312, -1.9531,  1.5703, -1.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-8.1250, -6.2812,  0.2695,  2.0781, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -1.1172,  1.9453, -0.6523, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -2.4531,  1.0156,  1.4219, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -5.0312, -0.5391,  3.7812, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -2.1094,  1.9219, -2.1719, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -3.8594, -0.3359,  2.0312, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:57:32,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:57:32,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.87 | bwd_microstep: 794.22 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 792.93 | step_microstep: 1.74
[2025-11-06 18:57:32,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.07 | bwd: 795.19 | bwd_inner: 2.08 | bwd_allreduce: 792.96 | step: 1.82
 84%|████████▍ | 2943/3507 [1:12:46<12:51,  1.37s/it]                                                     {'loss': 1.0261, 'learning_rate': 1.3266895173200056e-06, 'epoch': 0.84}
 84%|████████▍ | 2943/3507 [1:12:46<12:51,  1.37s/it]tensor([[-3.7656, -3.5469, -0.5508,  2.7188, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9844,  1.4766,  2.7031, -1.6641, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812, -5.3125, -0.4492,  1.1484, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -3.5156,  2.6094,  0.8438, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:32,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.98 | bwd_microstep: 1.82 | bwd_inner_microstep: 1.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-2.5469,  0.9805,  2.6719, -1.4766, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9219, -3.8906, -1.6094,  1.2656, -1.7422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0625, -1.9297,  2.7344, -1.8281, -5.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1250, -5.0625,  0.8828,  2.1250, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:57:33,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.25 | optimizer_step: 0.30
[2025-11-06 18:57:33,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 109.45 | bwd_microstep: 385.57 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 384.66 | step_microstep: 2.18
[2025-11-06 18:57:33,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.44 | bwd: 387.39 | bwd_inner: 2.50 | bwd_allreduce: 384.73 | step: 2.29
 84%|████████▍ | 2944/3507 [1:12:47<11:05,  1.18s/it]                                                     {'loss': 0.3785, 'learning_rate': 1.322095540346352e-06, 'epoch': 0.84}
 84%|████████▍ | 2944/3507 [1:12:47<11:05,  1.18s/it]tensor([[-2.7969, -3.4844, -2.4219,  1.1875, -0.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:33,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 119.01 | bwd_microstep: 1.14 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
tensor([[-4.6562, -0.6758,  3.0625, -1.4141, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -4.9688, -2.5000,  2.4531, -1.0859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2500, -4.5625,  0.5156,  1.6797, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3438, -5.9062, -0.4375,  1.7656, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6250, -6.6250, -1.1406,  2.1094, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -2.1250,  1.8516,  0.4180, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -3.7344,  0.9609,  1.7969, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:57:35,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.33 | optimizer_step: 0.29
[2025-11-06 18:57:35,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.21 | bwd_microstep: 1386.12 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1384.88 | step_microstep: 2.50
[2025-11-06 18:57:35,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 263.23 | bwd: 1387.26 | bwd_inner: 2.10 | bwd_allreduce: 1384.96 | step: 2.61
 84%|████████▍ | 2945/3507 [1:12:48<12:29,  1.33s/it]                                                     {'loss': 0.1614, 'learning_rate': 1.3175089679677922e-06, 'epoch': 0.84}
 84%|████████▍ | 2945/3507 [1:12:48<12:29,  1.33s/it]tensor([[-6.3750, -4.1250,  1.2891,  1.4062, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -4.2188,  0.3887,  3.9688, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -3.3750,  0.0277,  1.9141, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -3.4531,  1.3828,  1.2578, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5078,  1.7188,  3.7188,  0.4551, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.5938, -4.2812, -1.7344,  3.0000, -0.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:36,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 109.53 | bwd_microstep: 3.26 | bwd_inner_microstep: 3.14 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0000, -3.5312,  0.8047,  2.0469, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -4.8125, -1.2891,  2.5781, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:57:36,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:57:36,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.47 | bwd_microstep: 9.91 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 8.84 | step_microstep: 2.77
[2025-11-06 18:57:36,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.03 | bwd: 13.17 | bwd_inner: 4.13 | bwd_allreduce: 8.89 | step: 2.85
 84%|████████▍ | 2946/3507 [1:12:50<12:23,  1.33s/it]                                                     {'loss': 0.4962, 'learning_rate': 1.3129298040979133e-06, 'epoch': 0.84}
 84%|████████▍ | 2946/3507 [1:12:50<12:23,  1.33s/it]tensor([[-5.2812, -4.4688,  0.9102,  4.4062, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -2.9844,  1.0859,  1.6016, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:36,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.21 | bwd_microstep: 2.45 | bwd_inner_microstep: 2.33 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5312, -0.2754,  4.0625, -0.3984, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -4.3750, -0.5117,  2.1562, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -3.5938,  2.4844,  0.1074, -5.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -1.7891,  1.7500, -0.3691, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -3.7500,  1.1797, -0.5625, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -4.4062,  0.0693,  1.8906, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:57:37,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:57:37,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 300.19 | bwd_microstep: 1052.95 | bwd_inner_microstep: 1.47 | bwd_allreduce_microstep: 1051.36 | step_microstep: 1.76
[2025-11-06 18:57:37,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.44 | bwd: 1055.39 | bwd_inner: 3.81 | bwd_allreduce: 1051.40 | step: 1.85
 84%|████████▍ | 2947/3507 [1:12:51<13:06,  1.40s/it]                                                     {'loss': 0.746, 'learning_rate': 1.3083580526439787e-06, 'epoch': 0.84}
 84%|████████▍ | 2947/3507 [1:12:51<13:06,  1.40s/it]tensor([[-5.2500, -2.2656,  1.8672,  0.0388, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -5.4688, -2.2500,  2.2031, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -3.5469, -0.4453,  3.1875, -1.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -4.7500,  0.8633,  2.9219, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8281,  0.8008,  2.7188, -1.3438, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5000,  0.8086,  2.7812, -2.6094, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438,  1.4141,  3.5625, -1.9375, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:39,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.29 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -3.5469, -0.1738,  2.3125, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:39,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.80 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:57:39,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 210.00 | bwd_microstep: 3.20 | bwd_inner_microstep: 2.17 | bwd_allreduce_microstep: 0.94 | step_microstep: 3.91
[2025-11-06 18:57:39,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.31 | bwd: 4.19 | bwd_inner: 3.07 | bwd_allreduce: 0.97 | step: 4.00
 84%|████████▍ | 2948/3507 [1:12:53<14:37,  1.57s/it]                                                     {'loss': 0.1074, 'learning_rate': 1.303793717506927e-06, 'epoch': 0.84}
 84%|████████▍ | 2948/3507 [1:12:53<14:37,  1.57s/it]tensor([[-3.2969, -0.0201,  2.2969, -0.7188, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -4.8125, -1.4688,  2.7500, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -2.0938,  1.8516,  0.4297, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:40,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.75 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-3.6875, -0.1572,  2.0000, -1.5625, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -3.9844,  0.6445,  1.6328, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2500, -4.3750, -1.2266,  2.6562, -1.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -4.4688, -0.5508,  2.4062, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.2500,  0.2119,  2.4062, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:57:41,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.15 | optimizer_step: 0.25
[2025-11-06 18:57:41,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.70 | bwd_microstep: 849.54 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 848.19 | step_microstep: 1.77
[2025-11-06 18:57:41,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.48 | bwd: 850.64 | bwd_inner: 2.19 | bwd_allreduce: 848.25 | step: 1.88
 84%|████████▍ | 2949/3507 [1:12:55<13:43,  1.48s/it]                                                     {'loss': 0.1346, 'learning_rate': 1.2992368025813628e-06, 'epoch': 0.84}
 84%|████████▍ | 2949/3507 [1:12:55<13:43,  1.48s/it]tensor([[-4.0312, -2.7812,  0.9414,  2.4062, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -4.2812, -0.1953,  1.3281, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0000, -3.1875,  0.0527,  4.3750, -0.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5938,  1.8984,  3.6406, -2.3594, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.6250, -2.0625,  3.0000, -2.0156, -6.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2344,  0.4102,  2.9062, -1.1641, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9922, -3.2812, -2.7656,  1.7344,  0.3848]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:57:43,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.46 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8125, -4.0312,  0.0737,  2.8125, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:43,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 18:57:43,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.20 | bwd_microstep: 1.63 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.88
[2025-11-06 18:57:43,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.67 | bwd: 2.44 | bwd_inner: 1.47 | bwd_allreduce: 0.82 | step: 2.97
 84%|████████▍ | 2950/3507 [1:12:57<16:09,  1.74s/it]                                                     {'loss': 0.5185, 'learning_rate': 1.2946873117555692e-06, 'epoch': 0.84}
 84%|████████▍ | 2950/3507 [1:12:57<16:09,  1.74s/it]tensor([[-5.0625, -4.1250, -0.2373,  2.0000, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:43,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.01 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9375, -2.1406,  2.4219,  1.1250, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -1.9688,  2.2656,  0.1328, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -1.1094,  3.3281, -0.7578, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -4.7188,  0.7891,  2.2500, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3438, -4.3125, -0.3008,  1.7891, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -4.2812,  1.5078,  0.8398, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -5.1875, -2.8438,  1.4141, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:57:44,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 18:57:44,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.22 | bwd_microstep: 379.94 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 378.90 | step_microstep: 1.42
[2025-11-06 18:57:44,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.25 | bwd: 380.79 | bwd_inner: 1.72 | bwd_allreduce: 378.95 | step: 1.51
 84%|████████▍ | 2951/3507 [1:12:58<13:26,  1.45s/it]                                                     {'loss': 0.6071, 'learning_rate': 1.2901452489114896e-06, 'epoch': 0.84}
 84%|████████▍ | 2951/3507 [1:12:58<13:26,  1.45s/it]tensor([[-2.6406,  1.0781,  2.9375, -1.4844, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.7500, -4.7812,  0.8945, -2.0156, -7.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -4.0938,  0.5742,  1.9922, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -1.2500,  3.4219,  0.0466, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -4.3438, -0.4766,  2.2031, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6562, -4.8750,  1.2812,  0.7461, -5.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8750, -3.4844,  0.3086,  3.6562, -1.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:45,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.50 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -4.6250, -1.6641,  2.6250, -1.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:45,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:57:45,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.38 | bwd_microstep: 6.11 | bwd_inner_microstep: 5.16 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.68
[2025-11-06 18:57:45,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.91 | bwd: 6.80 | bwd_inner: 5.75 | bwd_allreduce: 0.90 | step: 6.76
 84%|████████▍ | 2952/3507 [1:12:59<13:13,  1.43s/it]                                                     {'loss': 0.6263, 'learning_rate': 1.2856106179247297e-06, 'epoch': 0.84}
 84%|████████▍ | 2952/3507 [1:12:59<13:13,  1.43s/it]tensor([[-0.7383, -1.5391,  0.2832,  4.8125,  1.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:45,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.30 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.09
tensor([[-3.3125, -1.2344,  1.8594,  1.8516, -2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -0.9648,  1.7656, -1.5469, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -3.2031,  0.3867,  3.7188, -1.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9062, -3.6719,  0.7969,  2.9688, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3750, -3.3594, -2.5469,  1.3828, -0.1328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.2188, -4.1250,  0.3105,  2.3594, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -0.3516,  2.3750, -0.2656, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:57:47,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 18:57:47,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.24 | bwd_microstep: 1711.74 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 1710.39 | step_microstep: 2.30
[2025-11-06 18:57:47,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.59 | bwd: 1712.86 | bwd_inner: 2.14 | bwd_allreduce: 1710.49 | step: 2.40
 84%|████████▍ | 2953/3507 [1:13:01<15:04,  1.63s/it]                                                     {'loss': 0.381, 'learning_rate': 1.281083422664553e-06, 'epoch': 0.84}
 84%|████████▍ | 2953/3507 [1:13:01<15:04,  1.63s/it]tensor([[-0.2793,  2.5156,  2.9219,  0.2441, -0.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -4.2188,  0.0100,  0.9648, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -2.7812,  2.2344, -0.3906, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -5.2188, -2.2656,  2.3750, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -2.0312,  2.1094,  0.0170, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[5.4062, 7.0000, 7.4062, 6.6562, 4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.6250, -7.8750, -2.7031, -1.1641, -6.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:48,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.92 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7812, -0.4688,  2.4062, -2.6094, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:49,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.71 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:57:49,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.89 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.37
[2025-11-06 18:57:49,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.84 | bwd: 2.86 | bwd_inner: 1.84 | bwd_allreduce: 0.87 | step: 2.47
 84%|████████▍ | 2954/3507 [1:13:02<14:02,  1.52s/it]                                                     {'loss': 0.5439, 'learning_rate': 1.2765636669938798e-06, 'epoch': 0.84}
 84%|████████▍ | 2954/3507 [1:13:02<14:02,  1.52s/it]tensor([[-4.4375, -1.3828,  1.6406, -0.6016, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-9.5625, -6.5625, -2.1719, -3.6875, -7.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:49,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.40 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0000, -4.0625, -2.0000, -4.0938, -5.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9375, -3.9062,  0.2305,  2.0938, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5938, -1.6875,  3.3750, -0.2100, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -0.5703,  2.8281, -0.4375, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.2500, -6.2188, -0.3691,  0.6367, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5469, -1.5156,  3.1250,  5.6875, -0.5742]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:57:50,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:57:50,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.77 | bwd_microstep: 1521.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 1521.04 | step_microstep: 2.57
[2025-11-06 18:57:50,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.19 | bwd: 1522.60 | bwd_inner: 1.38 | bwd_allreduce: 1521.08 | step: 2.65
 84%|████████▍ | 2955/3507 [1:13:04<14:52,  1.62s/it]                                                     {'loss': 0.8867, 'learning_rate': 1.2720513547692804e-06, 'epoch': 0.84}
 84%|████████▍ | 2955/3507 [1:13:04<14:52,  1.62s/it]tensor([[-1.1953,  2.6406,  3.5000, -1.6484, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -3.1562,  1.2578,  1.7891, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.9688, -0.5273,  2.2812, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:51,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.89 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3281, -0.8750,  2.6094,  1.7500, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -4.3750, -0.9570,  2.5312, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625, -1.3359,  1.7500, -0.2012, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3750,  1.0547,  3.4844, -2.2500, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9688, -4.5625, -1.7031, -0.6367, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:57:51,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:57:51,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.70 | bwd_microstep: 7.86 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 6.96 | step_microstep: 1.71
[2025-11-06 18:57:51,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.61 | bwd: 8.58 | bwd_inner: 1.42 | bwd_allreduce: 7.00 | step: 1.80
 84%|████████▍ | 2956/3507 [1:13:05<11:29,  1.25s/it]                                                     {'loss': 0.3748, 'learning_rate': 1.2675464898409772e-06, 'epoch': 0.84}
 84%|████████▍ | 2956/3507 [1:13:05<11:29,  1.25s/it]tensor([[-3.1094, -0.0304,  2.1719, -0.7305, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -2.3906,  3.2344,  0.0630, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -3.3125,  1.8203, -1.6094, -6.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:51,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.60 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0938, -6.0000, -2.9688,  2.5625, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1562, -2.4375,  2.8438, -0.2197, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -2.9375,  2.1719,  1.5859, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3438, -3.6719,  2.1406,  1.7031, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -2.9688,  1.3516,  1.8203, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:57:54,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.28
[2025-11-06 18:57:54,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.18 | bwd_microstep: 2742.27 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2741.22 | step_microstep: 2.06
[2025-11-06 18:57:54,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.80 | bwd: 2743.02 | bwd_inner: 1.59 | bwd_allreduce: 2741.28 | step: 2.15
 84%|████████▍ | 2957/3507 [1:13:08<16:45,  1.83s/it]                                                     {'loss': 0.2928, 'learning_rate': 1.2630490760528358e-06, 'epoch': 0.84}
 84%|████████▍ | 2957/3507 [1:13:08<16:45,  1.83s/it]tensor([[-3.4531,  0.9062,  3.7344, -1.5391, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -2.5312,  1.5938,  1.5938, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:54,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.82 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.4688, -5.8438, -2.7500,  2.0156, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.7188, -1.1016,  3.1094, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -2.0000,  2.4375,  2.4531, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -5.2188, -2.4844,  2.1875, -1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -2.5000,  0.9688, -1.2734, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -2.4219,  3.6250,  0.0518, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:54,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:57:54,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.19 | bwd_microstep: 43.79 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 42.52 | step_microstep: 1.46
[2025-11-06 18:57:54,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.02 | bwd: 44.70 | bwd_inner: 2.03 | bwd_allreduce: 42.55 | step: 1.54
 84%|████████▍ | 2958/3507 [1:13:08<12:47,  1.40s/it]                                                     {'loss': 0.2137, 'learning_rate': 1.2585591172423606e-06, 'epoch': 0.84}
 84%|████████▍ | 2958/3507 [1:13:08<12:47,  1.40s/it]tensor([[-5.9688, -5.0938, -0.4531,  2.1250, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -4.6562, -1.4375,  2.5469, -1.6953]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -4.1562, -0.9102,  2.5625, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:55,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.77 | bwd_microstep: 1.11 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5312, -4.0000,  0.3164,  3.6406, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3438, -5.1875, -1.0547, -0.9688, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5469,  0.0303,  2.2188, -1.4453, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5312, -3.8281,  2.1875,  1.8750, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -0.1377,  3.2344, -0.6211, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:57:58,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:57:58,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.31 | bwd_microstep: 2818.48 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 2817.33 | step_microstep: 2.04
[2025-11-06 18:57:58,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.11 | bwd: 2819.59 | bwd_inner: 2.08 | bwd_allreduce: 2817.37 | step: 2.13
 84%|████████▍ | 2959/3507 [1:13:11<17:49,  1.95s/it]                                                     {'loss': 0.2009, 'learning_rate': 1.254076617240706e-06, 'epoch': 0.84}
 84%|████████▍ | 2959/3507 [1:13:11<17:49,  1.95s/it]tensor([[-4.4062, -0.5273,  2.7812, -1.2578, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -4.6562,  0.7891,  2.9688, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.9688,  0.7812,  2.3438, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -2.0312,  2.7812,  0.0264, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:58,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.68 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.7227,  1.2031,  1.2422, -0.2217, -0.8398]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2656, -2.6250,  0.9375,  3.5625, -1.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5000, -5.0000,  1.0547,  0.9102, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.9375, -0.3984,  2.4531, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:57:58,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:57:58,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.20 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.67 | step_microstep: 1.46
[2025-11-06 18:57:58,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.91 | bwd: 2.67 | bwd_inner: 1.83 | bwd_allreduce: 0.70 | step: 1.55
 84%|████████▍ | 2960/3507 [1:13:12<13:39,  1.50s/it]                                                     {'loss': 0.6304, 'learning_rate': 1.249601579872648e-06, 'epoch': 0.84}
 84%|████████▍ | 2960/3507 [1:13:12<13:39,  1.50s/it]tensor([[-7.0000, -3.3125,  2.0000, -0.8125, -5.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7812, -3.7969,  2.0312,  1.0312, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:57:58,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.28 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8438, -0.8047,  3.7031, -0.3477, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -5.0625, -0.5742,  1.5859, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -1.9844,  2.4375, -0.1855, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8281, -2.5781,  0.7852,  4.0000, -0.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -4.3750, -1.1641,  2.8594, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -0.8945,  2.3438, -0.0520, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:58:01,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 18:58:01,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.43 | bwd_microstep: 2166.04 | bwd_inner_microstep: 5.40 | bwd_allreduce_microstep: 2160.54 | step_microstep: 2.50
[2025-11-06 18:58:01,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.73 | bwd: 2166.84 | bwd_inner: 6.12 | bwd_allreduce: 2160.58 | step: 2.57
 84%|████████▍ | 2961/3507 [1:13:14<16:29,  1.81s/it]                                                     {'loss': 0.1017, 'learning_rate': 1.2451340089566022e-06, 'epoch': 0.84}
 84%|████████▍ | 2961/3507 [1:13:14<16:29,  1.81s/it]tensor([[-4.4688, -4.7500, -2.0156,  1.9609, -1.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9844, -4.5938, -2.0312,  2.4844, -1.1641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -3.3438,  1.0781,  1.2812, -3.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -3.8125,  0.7500,  2.3125, -3.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:01,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.11 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.6719,  2.7656,  2.3281, -2.1719, -1.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -2.2969,  1.7891,  1.7266, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188,  0.7539,  4.0000, -2.4219, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7812, -3.4375,  1.9219,  1.9922, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:58:01,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:58:01,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.31 | bwd_microstep: 66.05 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 64.78 | step_microstep: 1.96
[2025-11-06 18:58:01,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.45 | bwd: 66.97 | bwd_inner: 2.00 | bwd_allreduce: 64.82 | step: 2.03
 84%|████████▍ | 2962/3507 [1:13:15<12:57,  1.43s/it]                                                     {'loss': 0.4035, 'learning_rate': 1.240673908304615e-06, 'epoch': 0.84}
 84%|████████▍ | 2962/3507 [1:13:15<12:57,  1.43s/it]tensor([[-4.9688, -2.0469,  2.6094,  1.0312, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9688, -6.3438, -1.7109,  1.6953, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -4.2500,  0.0317,  3.4688, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:01,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.73 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-8.7500, -7.2500, -1.0312,  1.7109, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -3.4375,  0.7812, -0.6562, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -3.5625,  1.4922,  1.5156, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -2.9531,  1.1641,  1.1328, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.0625, -4.9375,  0.7930, -0.6055, -6.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:58:02,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:58:02,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.42 | bwd_microstep: 306.40 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 305.48 | step_microstep: 1.96
[2025-11-06 18:58:02,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.18 | bwd: 307.24 | bwd_inner: 1.57 | bwd_allreduce: 305.52 | step: 2.03
 84%|████████▍ | 2963/3507 [1:13:16<11:00,  1.21s/it]                                                     {'loss': 0.7977, 'learning_rate': 1.2362212817223562e-06, 'epoch': 0.84}
 84%|████████▍ | 2963/3507 [1:13:16<11:00,  1.21s/it]tensor([[-3.1094,  0.7773,  2.4688, -2.4219, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -4.1562,  0.0767,  0.6016, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:02,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.32 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.3125, -1.2188,  2.2031,  0.0623, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -4.8750, -1.6641,  3.0156, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3496,  3.2969,  2.5781, -2.3750, -1.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.2188,  0.6758,  2.5156, -2.2656, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -1.5625,  1.9766,  1.7109, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250e+00, -2.7188e+00,  1.7422e+00, -4.1504e-03, -4.5625e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:58:02,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:58:02,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 112.11 | bwd_microstep: 114.97 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 113.71 | step_microstep: 2.05
[2025-11-06 18:58:02,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 258.43 | bwd: 115.90 | bwd_inner: 1.95 | bwd_allreduce: 113.77 | step: 2.15
 85%|████████▍ | 2964/3507 [1:13:16<08:49,  1.03it/s]                                                     {'loss': 0.6039, 'learning_rate': 1.2317761330091172e-06, 'epoch': 0.85}
 85%|████████▍ | 2964/3507 [1:13:16<08:49,  1.03it/s]tensor([[-3.5625,  0.5781,  3.5469, -1.3359, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -4.9375, -0.0266,  2.6875, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -2.7188,  1.7891,  1.3359, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3750, -5.7188,  0.4883,  2.5312, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3750, -4.1875,  0.8359,  1.0000, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -4.8750, -1.0703,  1.3984, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -2.2344,  2.6094,  0.9492, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:04,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.12 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-6.8438, -3.9375,  0.3340, -1.0703, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:04,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:58:04,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 2.08 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.99 | step_microstep: 2.04
[2025-11-06 18:58:04,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 449.58 | bwd: 3.08 | bwd_inner: 1.91 | bwd_allreduce: 1.03 | step: 2.15
 85%|████████▍ | 2965/3507 [1:13:18<11:13,  1.24s/it]                                                     {'loss': 0.4996, 'learning_rate': 1.2273384659578092e-06, 'epoch': 0.85}
 85%|████████▍ | 2965/3507 [1:13:18<11:13,  1.24s/it]tensor([[-7.1562, -4.3750,  1.9141,  1.4766, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -4.5312,  0.5000,  2.3594, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0312, -0.6094,  2.6719, -0.4707, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -4.8750, -1.2578,  2.3281, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0312, -3.7344,  0.7930, -1.1094, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4375, -5.1875, -0.5938,  1.8516, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:05,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.73 | bwd_microstep: 5.90 | bwd_inner_microstep: 5.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -5.0938, -1.3359,  2.5469, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -4.9375, -1.4766,  2.8594, -1.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:58:05,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 18:58:05,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.35 | bwd_microstep: 36.40 | bwd_inner_microstep: 5.62 | bwd_allreduce_microstep: 30.69 | step_microstep: 2.11
[2025-11-06 18:58:05,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.12 | bwd: 42.30 | bwd_inner: 11.43 | bwd_allreduce: 30.73 | step: 2.19
 85%|████████▍ | 2966/3507 [1:13:19<10:49,  1.20s/it]                                                     {'loss': 0.1319, 'learning_rate': 1.2229082843549622e-06, 'epoch': 0.85}
 85%|████████▍ | 2966/3507 [1:13:19<10:49,  1.20s/it]tensor([[-2.0000,  1.1406,  1.7031, -1.9219, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.4375, -5.3125, -1.2891,  0.3945, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.5156,  0.7109,  2.2812, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9688, -0.6367,  2.2344, -0.6641, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -3.6875, -0.0850,  2.7031, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.4219,  0.3398,  0.4883, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:06,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.18 | bwd_microstep: 5.27 | bwd_inner_microstep: 5.15 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.22
tensor([[-6.8125, -3.6250,  1.5547, -0.3301, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -3.2969,  1.5000,  1.6719, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:07,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:58:07,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.29 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.16
[2025-11-06 18:58:07,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.49 | bwd: 7.08 | bwd_inner: 6.03 | bwd_allreduce: 0.90 | step: 2.38
 85%|████████▍ | 2967/3507 [1:13:20<11:24,  1.27s/it]                                                     {'loss': 0.6316, 'learning_rate': 1.2184855919807149e-06, 'epoch': 0.85}
 85%|████████▍ | 2967/3507 [1:13:20<11:24,  1.27s/it]tensor([[-5.0312, -2.9844,  1.0547,  1.1797, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -3.3438,  0.6445,  2.0156, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -5.0938, -1.3984,  2.9219, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594,  1.5859,  4.6562, -2.3125, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:07,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.21 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -3.9688, -0.0894,  3.4688, -1.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -1.4688,  3.6875,  0.4473, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.0156,  1.6172,  2.2812, -1.9141, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.8594, -4.2812, -1.6797,  2.5938, -1.1328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:58:09,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:58:09,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.64 | bwd_microstep: 1796.36 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1795.04 | step_microstep: 3.34
[2025-11-06 18:58:09,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.87 | bwd: 1797.04 | bwd_inner: 1.78 | bwd_allreduce: 1795.07 | step: 3.42
 85%|████████▍ | 2968/3507 [1:13:23<14:04,  1.57s/it]                                                     {'loss': 0.4433, 'learning_rate': 1.2140703926088182e-06, 'epoch': 0.85}
 85%|████████▍ | 2968/3507 [1:13:23<14:04,  1.57s/it]tensor([[-5.7500, -4.4375,  0.2852,  2.2812, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -0.9727,  3.4531, -2.0625, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:09,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.41 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-7.4688, -4.7500,  1.1406,  0.6055, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.7812, -6.2188, -0.2520,  1.8828, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3750, -4.5000,  0.8711,  1.8203, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.9375, -0.4082,  1.9375, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594,  0.3633,  3.1875, -1.7031, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438,  1.2578,  3.5312, -1.2734, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:58:10,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.31 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 18:58:10,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.99 | bwd_microstep: 1136.19 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1135.07 | step_microstep: 3.63
[2025-11-06 18:58:10,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.43 | bwd: 1137.07 | bwd_inner: 1.76 | bwd_allreduce: 1135.13 | step: 3.74
 85%|████████▍ | 2969/3507 [1:13:24<13:57,  1.56s/it]                                                     {'loss': 0.2338, 'learning_rate': 1.20966269000663e-06, 'epoch': 0.85}
 85%|████████▍ | 2969/3507 [1:13:24<13:57,  1.56s/it]tensor([[-4.9688, -3.6562,  0.1475,  1.4609, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -3.5469,  2.2344,  1.5469, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -0.2285,  2.1719, -2.3125, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:11,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.66 | bwd_microstep: 8.22 | bwd_inner_microstep: 8.09 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5938, -1.1562,  2.7500, -0.3379, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -2.1719,  2.4531, -0.4492, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.3750, -4.4062,  1.8828,  1.0312, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969,  0.6562,  3.1719, -1.2344, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7500, -3.9531, -0.0288,  2.4219, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:11,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:58:11,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.50 | bwd_microstep: 225.57 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 224.68 | step_microstep: 1.93
[2025-11-06 18:58:11,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.17 | bwd: 233.79 | bwd_inner: 8.90 | bwd_allreduce: 224.74 | step: 2.02
 85%|████████▍ | 2970/3507 [1:13:25<11:22,  1.27s/it]                                                     {'loss': 0.6004, 'learning_rate': 1.2052624879351105e-06, 'epoch': 0.85}
 85%|████████▍ | 2970/3507 [1:13:25<11:22,  1.27s/it]tensor([[-5.6875, -3.8125,  0.6250,  1.3750, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.3438, -2.4219, -2.9375,  0.5078,  0.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:58:11,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.62 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.8125, -2.6719,  2.0312, -1.9688, -6.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5000,  1.4062,  3.0312, -1.7188, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -4.3125, -1.2266,  3.1719, -1.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -0.6602,  2.5000, -1.3594, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4062, -1.1875,  3.6562, -0.8086, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -4.6250,  0.4141,  2.7500, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:14,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 18:58:14,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.80 | bwd_microstep: 1067.04 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 1066.11 | step_microstep: 2.16
[2025-11-06 18:58:14,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 262.43 | bwd: 1067.73 | bwd_inner: 1.42 | bwd_allreduce: 1066.16 | step: 2.24
 85%|████████▍ | 2971/3507 [1:13:28<16:16,  1.82s/it]                                                     {'loss': 0.2005, 'learning_rate': 1.2008697901488187e-06, 'epoch': 0.85}
 85%|████████▍ | 2971/3507 [1:13:28<16:16,  1.82s/it]tensor([[-5.6875, -3.7969,  0.9609,  1.3672, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -5.2188, -2.4531,  2.5781, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -6.0312, -2.0312,  1.9844, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:14,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.84 | bwd_microstep: 3.79 | bwd_inner_microstep: 3.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.3594,  1.0703,  2.7969, -0.8008, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -2.1562,  1.8047,  0.1846, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -4.0938, -0.2109,  2.3750, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8125, -0.2832,  1.7891,  0.0693, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -5.1250, -1.3828,  1.9766, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:16,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.30
[2025-11-06 18:58:16,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.16 | bwd_microstep: 923.01 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 922.10 | step_microstep: 2.32
[2025-11-06 18:58:16,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.03 | bwd: 926.80 | bwd_inner: 4.51 | bwd_allreduce: 922.14 | step: 2.40
 85%|████████▍ | 2972/3507 [1:13:29<14:58,  1.68s/it]                                                     {'loss': 0.4373, 'learning_rate': 1.1964846003959118e-06, 'epoch': 0.85}
 85%|████████▍ | 2972/3507 [1:13:29<14:58,  1.68s/it]tensor([[-5.3438, -1.8984,  2.4375, -0.4199, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -4.5938, -0.8438,  2.0312, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:16,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.96 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.7500, -4.9688,  0.7734,  2.5625, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7812,  0.1318,  3.3438,  1.3438, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.6250, -5.2188, -1.9844,  1.2734, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -4.1250, -0.1807,  3.1562, -1.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3750, -0.2188,  2.4688,  0.1484, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -3.2188,  1.8828,  2.5469, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:17,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 18:58:17,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.94 | bwd_microstep: 515.01 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 513.79 | step_microstep: 1.81
[2025-11-06 18:58:17,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.92 | bwd: 516.13 | bwd_inner: 2.12 | bwd_allreduce: 513.84 | step: 1.91
 85%|████████▍ | 2973/3507 [1:13:31<14:32,  1.63s/it]                                                     {'loss': 0.5428, 'learning_rate': 1.1921069224181413e-06, 'epoch': 0.85}
 85%|████████▍ | 2973/3507 [1:13:31<14:32,  1.63s/it]tensor([[-3.9375, -0.9414,  2.2500, -0.1836, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1562, -5.0000,  1.0156,  1.7500, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:17,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 286.01 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -5.3438, -1.0312,  3.1094, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -1.5547,  2.0469, -0.2891, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -0.3105,  3.7812, -1.7969, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -0.8398,  2.2188, -1.3984, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -0.9258,  3.2188, -1.9844, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -0.3184,  2.5156, -0.7266, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:58:19,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.27 | optimizer_step: 0.27
[2025-11-06 18:58:19,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.34 | bwd_microstep: 1460.08 | bwd_inner_microstep: 8.35 | bwd_allreduce_microstep: 1451.62 | step_microstep: 2.58
[2025-11-06 18:58:19,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 481.37 | bwd: 1461.07 | bwd_inner: 9.20 | bwd_allreduce: 1451.68 | step: 2.67
 85%|████████▍ | 2974/3507 [1:13:33<15:27,  1.74s/it]                                                     {'loss': 0.1092, 'learning_rate': 1.187736759950846e-06, 'epoch': 0.85}
 85%|████████▍ | 2974/3507 [1:13:33<15:27,  1.74s/it]tensor([[-4.1250, -5.1250, -3.2500,  1.6641, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:19,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.74 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.2812, -4.2812,  0.8984,  3.9219, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5625, -3.9531, -1.2578,  3.0781, -0.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -5.0625,  0.6367,  1.8359, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -0.0854,  3.4844, -1.8828, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -1.4219,  3.5938,  0.1235, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -4.4375, -1.0469,  2.3281, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -2.2812,  1.7266,  2.8438, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:20,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:58:20,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.27 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.49
[2025-11-06 18:58:20,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.03 | bwd: 2.69 | bwd_inner: 1.69 | bwd_allreduce: 0.86 | step: 2.59
 85%|████████▍ | 2975/3507 [1:13:34<14:20,  1.62s/it]                                                     {'loss': 0.1035, 'learning_rate': 1.1833741167229584e-06, 'epoch': 0.85}
 85%|████████▍ | 2975/3507 [1:13:34<14:20,  1.62s/it]tensor([[-4.9375, -5.2500, -1.8750,  2.6875, -1.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -2.6094,  1.3125,  0.4688, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:21,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.09 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7812, -4.2812, -1.6562,  2.6875, -1.0703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -2.0469,  0.9492, -1.3594, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5312, -5.3438,  0.3926,  2.9531, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6250,  0.8906,  2.6094, -1.2344, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.4688, -3.6094,  1.9297,  3.0781, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -5.2188, -0.3320,  1.6641, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:22,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 18:58:22,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.29 | bwd_microstep: 1546.29 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 1545.34 | step_microstep: 1.89
[2025-11-06 18:58:22,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.41 | bwd: 1546.97 | bwd_inner: 1.44 | bwd_allreduce: 1545.38 | step: 1.99
 85%|████████▍ | 2976/3507 [1:13:36<15:07,  1.71s/it]                                                     {'loss': 0.621, 'learning_rate': 1.1790189964569899e-06, 'epoch': 0.85}
 85%|████████▍ | 2976/3507 [1:13:36<15:07,  1.71s/it]tensor([[-5.6562, -2.2344,  2.0312, -1.0078, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -4.6562,  0.3906,  3.7188, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -2.6875,  0.7930,  0.4414, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -4.3750, -0.2061,  2.4844, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:23,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.41 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2500, -5.1562, -1.3750,  2.6875, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -3.4844,  0.6562, -1.1953, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -1.2500,  2.9375, -1.5156, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -5.1250, -2.5625,  0.9102, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:24,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:58:24,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.36 | bwd_microstep: 2.28 | bwd_inner_microstep: 1.35 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.10
[2025-11-06 18:58:24,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.80 | bwd: 2.93 | bwd_inner: 1.92 | bwd_allreduce: 0.89 | step: 2.18
 85%|████████▍ | 2977/3507 [1:13:38<15:22,  1.74s/it]                                                     {'loss': 0.5422, 'learning_rate': 1.1746714028690287e-06, 'epoch': 0.85}
 85%|████████▍ | 2977/3507 [1:13:38<15:22,  1.74s/it]tensor([[-3.6250, -1.2891,  1.7656,  0.8633, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8125, -6.0625, -2.1406,  2.4375, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:24,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.69 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -4.5938, -0.9531,  1.4375, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0000, -5.5312, -0.3691,  1.4844, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -0.5508,  4.2500, -1.0312, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8047,  1.1328,  2.3281, -0.5547, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188,  0.1895,  4.5938, -1.5859, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[0.2490, 0.6719, 2.8750, 5.2500, 1.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:25,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:58:25,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.81 | bwd_microstep: 358.15 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 356.65 | step_microstep: 1.92
[2025-11-06 18:58:25,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.51 | bwd: 359.06 | bwd_inner: 2.21 | bwd_allreduce: 356.71 | step: 2.01
 85%|████████▍ | 2978/3507 [1:13:39<12:47,  1.45s/it]                                                     {'loss': 0.3729, 'learning_rate': 1.1703313396687521e-06, 'epoch': 0.85}
 85%|████████▍ | 2978/3507 [1:13:39<12:47,  1.45s/it]tensor([[-6.2812, -5.0312,  0.2109,  2.7344, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:25,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.13 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5625, -2.9062,  1.3828,  0.3340, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5312, -4.8750, -0.4160,  2.7812, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -0.9805,  2.5156, -0.1816, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1562, -5.5000, -0.1445,  1.3125, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5938, -6.0000, -0.3398,  1.8281, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -2.8281,  1.2500,  0.8750, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.9375, -5.7188,  0.4121,  1.1953, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:27,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.17 | optimizer_step: 0.24
[2025-11-06 18:58:27,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.43 | bwd_microstep: 3.26 | bwd_inner_microstep: 2.14 | bwd_allreduce_microstep: 1.01 | step_microstep: 6.65
[2025-11-06 18:58:27,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 446.57 | bwd: 4.08 | bwd_inner: 2.86 | bwd_allreduce: 1.06 | step: 6.74
 85%|████████▍ | 2979/3507 [1:13:41<14:46,  1.68s/it]                                                     {'loss': 0.2218, 'learning_rate': 1.1659988105594022e-06, 'epoch': 0.85}
 85%|████████▍ | 2979/3507 [1:13:41<14:46,  1.68s/it]tensor([[-6.2500, -5.3750,  0.0107,  3.2812, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -2.0156,  2.1406, -0.5547, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:27,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.68 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3125, -4.3438,  0.1641,  2.7188, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -4.9688,  0.1650,  2.0938, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -3.0625,  2.0156,  1.9453, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6250, -2.7500,  2.7812, -0.3652, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -1.2031,  2.7031,  0.1235, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -0.1387,  2.7031, -1.4766, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:58:28,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.84 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:58:28,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.90 | bwd_microstep: 320.48 | bwd_inner_microstep: 4.65 | bwd_allreduce_microstep: 315.73 | step_microstep: 2.85
[2025-11-06 18:58:28,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.61 | bwd: 321.24 | bwd_inner: 5.30 | bwd_allreduce: 315.78 | step: 2.94
 85%|████████▍ | 2980/3507 [1:13:42<12:16,  1.40s/it]                                                     {'loss': 0.152, 'learning_rate': 1.1616738192377963e-06, 'epoch': 0.85}
 85%|████████▍ | 2980/3507 [1:13:42<12:16,  1.40s/it]tensor([[-5.0625, -5.3438, -2.1406,  2.2188, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4688, -1.5469,  0.2695,  1.1641, -1.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -3.1875,  1.2656,  3.7969, -1.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:28,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.31 | bwd_microstep: 6.03 | bwd_inner_microstep: 5.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.7656, -3.0000, -0.8945,  2.9219, -0.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-8.0625, -4.9062,  1.3672, -0.1289, -6.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375, -3.7812, -2.0469,  2.3281, -0.3652]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.1797,  1.1484,  1.4766, -0.3301, -1.2578]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.2812, -3.5156, -0.5430,  3.4062, -0.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:29,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 18:58:29,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.32 | bwd_microstep: 348.03 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 347.12 | step_microstep: 2.19
[2025-11-06 18:58:29,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.66 | bwd: 354.03 | bwd_inner: 6.63 | bwd_allreduce: 347.18 | step: 2.28
 85%|████████▌ | 2981/3507 [1:13:43<12:55,  1.47s/it]                                                     {'loss': 0.6722, 'learning_rate': 1.1573563693943202e-06, 'epoch': 0.85}
 85%|████████▌ | 2981/3507 [1:13:43<12:55,  1.47s/it]tensor([[-3.6719, -4.1250, -2.4219,  1.1016, -1.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:30,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.78 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0625, -2.4844,  2.4531, -0.2773, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -4.1250, -0.5391,  1.8984, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -4.8438, -0.9258,  2.6250, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -0.7617,  2.5625, -0.9922, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -4.7500, -1.0547,  3.0156, -1.7891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -4.2188,  1.0938,  0.9922, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.1562, -4.3750, -0.2969,  2.3438, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:58:32,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 5.38 | optimizer_gradients: 0.24 | optimizer_step: 0.21
[2025-11-06 18:58:32,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.91 | bwd_microstep: 2119.78 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2118.72 | step_microstep: 11.87
[2025-11-06 18:58:32,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.71 | bwd: 2120.44 | bwd_inner: 1.53 | bwd_allreduce: 2118.76 | step: 11.94
 85%|████████▌ | 2982/3507 [1:13:46<15:33,  1.78s/it]                                                     {'loss': 0.809, 'learning_rate': 1.1530464647129235e-06, 'epoch': 0.85}
 85%|████████▌ | 2982/3507 [1:13:46<15:33,  1.78s/it]tensor([[-5.4375, -1.4844,  0.7031, -3.7969, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -4.5625, -0.3105,  4.6875, -1.1953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344,  1.2969,  3.8594, -1.8828, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5312, -3.6094,  1.9453,  0.6211, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:32,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.68 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4531, -3.5938, -0.4492,  3.5781, -0.8789]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.1875, -0.0238,  2.0625, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -1.0391,  2.4844, -0.2695, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4688, -2.5000,  3.2656, -0.2715, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:58:33,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.97 | optimizer_gradients: 0.20 | optimizer_step: 0.17
[2025-11-06 18:58:33,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.18 | bwd_microstep: 189.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 189.00 | step_microstep: 3.71
[2025-11-06 18:58:33,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.89 | bwd: 190.69 | bwd_inner: 1.49 | bwd_allreduce: 189.05 | step: 3.79
 85%|████████▌ | 2983/3507 [1:13:46<12:29,  1.43s/it]                                                     {'loss': 0.091, 'learning_rate': 1.1487441088711194e-06, 'epoch': 0.85}
 85%|████████▌ | 2983/3507 [1:13:46<12:29,  1.43s/it]tensor([[-6.8438, -5.3125,  0.3184,  2.4219, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:33,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.13 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.4375, -1.7891,  1.4766,  1.6250, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6875, -4.6875,  0.5234,  0.9766, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -2.5156,  2.9219,  1.3203, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7656, -2.7656, -2.0000,  1.8984,  0.3770]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.9062, -5.5625,  0.9844,  1.6875, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -3.8906,  0.8164,  3.6562, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8438, -5.8125,  0.3926,  1.4531, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:58:35,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 4.75 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 18:58:35,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.32 | bwd_microstep: 2456.37 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 2455.15 | step_microstep: 10.62
[2025-11-06 18:58:35,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.47 | bwd: 2457.13 | bwd_inner: 1.74 | bwd_allreduce: 2455.20 | step: 10.70
 85%|████████▌ | 2984/3507 [1:13:49<16:10,  1.86s/it]                                                     {'loss': 0.7194, 'learning_rate': 1.1444493055399774e-06, 'epoch': 0.85}
 85%|████████▌ | 2984/3507 [1:13:49<16:10,  1.86s/it]tensor([[-5.9062, -3.3125,  1.7031,  1.1719, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -1.7578,  1.2266, -2.2344, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -5.0938, -2.0000,  2.8438, -1.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4375, -4.2812,  0.4863,  0.6758, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -4.1250,  0.4043,  2.6562, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:36,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.93 | bwd_microstep: 5.58 | bwd_inner_microstep: 5.47 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7188, -3.3594,  2.0469,  0.0383, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8125, -7.0000, -2.5469,  2.2500, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -3.9688,  0.4902,  1.0547, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:36,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.17 | optimizer_step: 0.28
[2025-11-06 18:58:36,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.16 | bwd_microstep: 1.98 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.94 | step_microstep: 2.39
[2025-11-06 18:58:36,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.13 | bwd: 7.55 | bwd_inner: 6.42 | bwd_allreduce: 0.98 | step: 2.48
 85%|████████▌ | 2985/3507 [1:13:50<12:29,  1.44s/it]                                                     {'loss': 0.3971, 'learning_rate': 1.1401620583841255e-06, 'epoch': 0.85}
 85%|████████▌ | 2985/3507 [1:13:50<12:29,  1.44s/it]tensor([[-2.5156,  1.5859,  2.9688, -2.7031, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:36,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.49 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9688, -5.0625, -0.2236,  2.8281, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.2109,  1.0781,  3.5781,  4.2188,  0.4785]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031, -4.5938, -1.9844,  3.1406, -0.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -4.6250, -1.4062,  2.6562, -1.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -4.2500,  0.1328,  1.8594, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8164,  2.5625,  3.3281, -0.8711, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -3.9219, -0.7852,  1.2969, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:58:43,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 18:58:43,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.69 | bwd_microstep: 6512.14 | bwd_inner_microstep: 6.68 | bwd_allreduce_microstep: 6505.34 | step_microstep: 2.43
[2025-11-06 18:58:43,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.20 | bwd: 6512.87 | bwd_inner: 7.31 | bwd_allreduce: 6505.38 | step: 2.51
 85%|████████▌ | 2986/3507 [1:13:57<27:48,  3.20s/it]                                                     {'loss': 0.2721, 'learning_rate': 1.1358823710617395e-06, 'epoch': 0.85}
 85%|████████▌ | 2986/3507 [1:13:57<27:48,  3.20s/it]tensor([[-4.7812, -0.4941,  3.6094, -1.2266, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -4.3125, -0.3105,  1.9844, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -1.1484,  2.6406, -2.5781, -5.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -0.0444,  3.9688,  0.3164, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:44,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.38 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -4.8125, -1.6875,  3.0000, -1.3047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -5.8750, -2.1094,  1.1953, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -3.2031,  2.7812,  1.2656, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.8594,  0.2910,  2.1719, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:44,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.14 | optimizer_step: 0.23
[2025-11-06 18:58:44,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.30 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.95 | step_microstep: 1.48
[2025-11-06 18:58:44,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.70 | bwd: 3.21 | bwd_inner: 2.10 | bwd_allreduce: 0.99 | step: 1.56
 85%|████████▌ | 2987/3507 [1:13:58<20:44,  2.39s/it]                                                     {'loss': 0.0778, 'learning_rate': 1.131610247224555e-06, 'epoch': 0.85}
 85%|████████▌ | 2987/3507 [1:13:58<20:44,  2.39s/it]tensor([[-3.3281, -3.4375,  0.4785,  4.8750, -0.6445]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -1.4375,  2.2500,  0.9180, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -1.7656,  2.6719,  1.2578, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:44,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.31 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5938, -3.6562, -0.6641,  3.1250, -1.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-6.0625, -4.5625,  0.1816,  1.8828, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.4688, -5.6875,  0.4570,  2.1094, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -5.5625, -1.7734,  2.9688, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0938, -4.2500,  1.7734,  0.8906, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:58:44,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:58:44,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.70 | bwd_microstep: 32.07 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 31.31 | step_microstep: 1.98
[2025-11-06 18:58:44,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.04 | bwd: 32.91 | bwd_inner: 1.42 | bwd_allreduce: 31.34 | step: 2.05
 85%|████████▌ | 2988/3507 [1:13:58<15:30,  1.79s/it]                                                     {'loss': 1.0566, 'learning_rate': 1.1273456905178392e-06, 'epoch': 0.85}
 85%|████████▌ | 2988/3507 [1:13:58<15:30,  1.79s/it]tensor([[-4.3750, -1.6016,  3.0000,  1.8359, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:44,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.27 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8125, -4.0000, -1.3594,  2.4375, -1.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -5.4375, -0.7656,  2.6875, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -2.7188,  2.7656,  0.0459, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1875, -4.5938, -0.0698, -1.0156, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.4922,  2.4062,  3.1562, -0.1021, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -0.6211,  3.3906, -2.2188, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7500, -0.7188,  3.3594, -1.1406, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:58:45,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.81 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:58:45,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.01 | bwd_microstep: 179.35 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 178.48 | step_microstep: 2.23
[2025-11-06 18:58:45,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.29 | bwd: 180.06 | bwd_inner: 1.40 | bwd_allreduce: 178.52 | step: 2.30
 85%|████████▌ | 2989/3507 [1:13:59<12:24,  1.44s/it]                                                     {'loss': 0.4217, 'learning_rate': 1.1230887045804151e-06, 'epoch': 0.85}
 85%|████████▌ | 2989/3507 [1:13:59<12:24,  1.44s/it]tensor([[-4.2188, -1.5391,  1.6875,  0.1934, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -5.2500, -2.2969,  1.7422, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:58:45,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.35 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1875, -0.8086,  2.7031, -0.1226, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375,  0.3477,  2.5312, -0.5195, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.3750, -4.1250, -1.4688,  1.3359, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6250, -1.4609,  3.0625, -1.3203, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812,  0.4648,  3.7344, -2.5312, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -4.2812, -0.8164,  1.8594, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:58:47,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:58:47,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.61 | bwd_microstep: 1976.36 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1975.23 | step_microstep: 2.19
[2025-11-06 18:58:47,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.99 | bwd: 1977.16 | bwd_inner: 1.76 | bwd_allreduce: 1975.26 | step: 2.26
 85%|████████▌ | 2990/3507 [1:14:01<14:53,  1.73s/it]                                                     {'loss': 0.8682, 'learning_rate': 1.1188392930446368e-06, 'epoch': 0.85}
 85%|████████▌ | 2990/3507 [1:14:01<14:53,  1.73s/it]tensor([[-4.9688, -1.3516,  2.6250, -0.7461, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:47,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.68 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[2.8750, 2.6875, 4.9062, 8.3750, 3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -4.3438, -0.7148,  3.0000, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938, -3.5156, -0.2471,  3.3438, -1.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.4375, -5.1562, -0.8125,  1.2812, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -4.6875, -0.7070,  3.1250, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.4375, -0.2393,  3.2969, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1406,  2.7188,  3.2500, -1.6328, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:58:48,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.29 | optimizer_step: 0.29
[2025-11-06 18:58:48,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.50 | bwd_microstep: 420.35 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 419.09 | step_microstep: 3.05
[2025-11-06 18:58:48,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 257.20 | bwd: 421.25 | bwd_inner: 1.95 | bwd_allreduce: 419.14 | step: 3.14
 85%|████████▌ | 2991/3507 [1:14:02<12:14,  1.42s/it]                                                     {'loss': 0.1709, 'learning_rate': 1.114597459536404e-06, 'epoch': 0.85}
 85%|████████▌ | 2991/3507 [1:14:02<12:14,  1.42s/it]tensor([[-2.7969,  1.5625,  3.2969, -2.3750, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0000, -0.3008,  2.2188,  0.0618, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0938,  0.7656,  3.3125, -1.2109, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9062, -2.6719,  0.8125,  2.1094, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:48,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.40 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188e+00, -3.7188e+00,  4.6997e-03,  2.0000e+00, -2.6094e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -3.7031,  0.2109,  3.7031, -1.4141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -2.0781,  1.6641, -0.0270, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -1.7422,  3.1562, -2.1250, -6.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:58:51,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 18:58:51,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.79 | bwd_microstep: 2481.09 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 2479.67 | step_microstep: 2.79
[2025-11-06 18:58:51,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.22 | bwd: 2482.14 | bwd_inner: 2.28 | bwd_allreduce: 2479.72 | step: 2.87
 85%|████████▌ | 2992/3507 [1:14:05<16:35,  1.93s/it]                                                     {'loss': 0.4549, 'learning_rate': 1.1103632076751459e-06, 'epoch': 0.85}
 85%|████████▌ | 2992/3507 [1:14:05<16:35,  1.93s/it]tensor([[-3.7969,  0.5156,  2.9531, -2.2969, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:51,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.88 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3438, -3.6875,  2.1094,  1.8906, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -1.9531,  1.4375,  1.3359, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -1.2656,  2.4688, -1.4844, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9688, -3.0469,  0.1465,  4.0625, -0.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -3.3750,  0.0684,  1.7891, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8438, -0.6992,  1.2500,  4.0938,  0.7305]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9062, -3.6875,  2.0156, -0.0908, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:52,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:58:52,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.38 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.60
[2025-11-06 18:58:52,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.29 | bwd: 2.83 | bwd_inner: 1.94 | bwd_allreduce: 0.76 | step: 1.69
 85%|████████▌ | 2993/3507 [1:14:05<12:58,  1.51s/it]                                                     {'loss': 0.2418, 'learning_rate': 1.1061365410738168e-06, 'epoch': 0.85}
 85%|████████▌ | 2993/3507 [1:14:05<12:58,  1.51s/it]tensor([[-4.3750, -2.3594,  0.5859,  0.3066, -3.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9688, -5.4375, -0.0835,  1.5938, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -2.8906,  2.7031,  1.2188, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:52,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.85 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.12
tensor([[-3.9375, -3.3906,  0.4297,  3.2188, -1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -6.3125, -1.2422,  1.6953, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6250, -2.3906,  2.4531, -1.9922, -6.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -3.5781, -2.1250,  2.2500, -0.2002]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -2.1875,  1.3672, -0.1475, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:58:54,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 18:58:54,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.36 | bwd_microstep: 1730.62 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1729.56 | step_microstep: 1.74
[2025-11-06 18:58:54,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.24 | bwd: 1731.49 | bwd_inner: 1.73 | bwd_allreduce: 1729.61 | step: 1.88
 85%|████████▌ | 2994/3507 [1:14:08<14:42,  1.72s/it]                                                     {'loss': 0.5846, 'learning_rate': 1.1019174633389073e-06, 'epoch': 0.85}
 85%|████████▌ | 2994/3507 [1:14:08<14:42,  1.72s/it]tensor([[-6.5000, -3.4375,  1.4609,  0.1738, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:54,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.38 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.7500, -4.2812, -0.2773,  3.2656, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -4.5625, -0.4023,  2.4688, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -3.8906,  1.7031,  1.4531, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -3.0938,  1.0625,  0.3965, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -2.7031,  1.1562,  0.4824, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -4.2188, -1.9219,  2.8750, -0.6328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.1250, -3.2656,  1.0625,  1.6406, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:58:54,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.29
[2025-11-06 18:58:54,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.22 | bwd_microstep: 146.95 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 145.77 | step_microstep: 2.29
[2025-11-06 18:58:54,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.63 | bwd: 147.96 | bwd_inner: 2.02 | bwd_allreduce: 145.81 | step: 2.36
 85%|████████▌ | 2995/3507 [1:14:08<11:37,  1.36s/it]                                                     {'loss': 0.9675, 'learning_rate': 1.0977059780704314e-06, 'epoch': 0.85}
 85%|████████▌ | 2995/3507 [1:14:08<11:37,  1.36s/it]tensor([[-5.7500, -3.5469,  0.5195,  0.4746, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3125, -5.4375,  0.6328,  2.0625, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562, -2.1094,  1.5938,  4.9062, -0.4883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:58:54,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.71 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.7188, -0.4434,  3.6875, -1.1094, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7188, -5.8438, -1.0938,  1.9375, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.6250, -7.7812, -2.2656,  1.5938, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -3.4219,  0.7422,  2.0156, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -3.0938,  1.2734,  2.8594, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:58:57,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 18:58:57,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.19 | bwd_microstep: 2862.92 | bwd_inner_microstep: 5.18 | bwd_allreduce_microstep: 2857.64 | step_microstep: 2.00
[2025-11-06 18:58:57,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.91 | bwd: 2863.92 | bwd_inner: 6.06 | bwd_allreduce: 2857.69 | step: 2.10
 85%|████████▌ | 2996/3507 [1:14:11<16:24,  1.93s/it]                                                     {'loss': 0.3654, 'learning_rate': 1.0935020888619218e-06, 'epoch': 0.85}
 85%|████████▌ | 2996/3507 [1:14:11<16:24,  1.93s/it]tensor([[-7.5625, -6.3125, -0.3535,  2.5000, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2891,  1.8359,  3.3438,  0.3145, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:58:58,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.41 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.0000, -0.1113,  1.9141, -2.4219, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1250, -4.8750,  1.0234,  1.7812, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1406, -1.9141,  2.1094,  4.0000, -1.3672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6250, -1.4375,  1.7734,  1.1641, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -4.1562, -0.2988,  3.9844, -1.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.8438,  2.4219,  3.1562, -2.7188, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:58:58,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 18:58:58,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.25 | bwd_microstep: 513.47 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 512.63 | step_microstep: 1.77
[2025-11-06 18:58:58,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 281.67 | bwd: 514.36 | bwd_inner: 1.57 | bwd_allreduce: 512.66 | step: 1.84
 85%|████████▌ | 2997/3507 [1:14:12<13:34,  1.60s/it]                                                     {'loss': 0.5139, 'learning_rate': 1.0893057993004297e-06, 'epoch': 0.85}
 85%|████████▌ | 2997/3507 [1:14:12<13:34,  1.60s/it]tensor([[-3.2969, -1.9609,  0.7383,  1.4844, -1.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -3.0938,  1.9766,  0.7539, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -5.1250, -0.4688,  1.9688, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:58:59,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.02 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8125, -4.6875, -0.6953,  3.2656, -1.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -1.8516,  1.6797,  0.3926, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7812, -4.4375, -2.0938,  2.5469, -0.9805]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -4.5938, -0.2891,  3.0781, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -3.5938, -2.3594,  1.1328, -0.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 18:59:00,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:59:00,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.66 | bwd_microstep: 1130.87 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1129.76 | step_microstep: 1.63
[2025-11-06 18:59:00,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.70 | bwd: 1131.78 | bwd_inner: 1.86 | bwd_allreduce: 1129.80 | step: 1.71
 85%|████████▌ | 2998/3507 [1:14:14<13:27,  1.59s/it]                                                     {'loss': 0.7085, 'learning_rate': 1.085117112966525e-06, 'epoch': 0.85}
 85%|████████▌ | 2998/3507 [1:14:14<13:27,  1.59s/it]tensor([[-6.7500, -2.2031,  3.2188, -1.5312, -6.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -4.4688, -0.1406,  1.2812, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:00,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.75 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-7.1250, -4.3125,  0.8555, -0.3789, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.9062, -5.9062, -1.0000, -0.1963, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1406,  2.2500,  3.0938, -3.1562, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9062, -3.5000,  0.1406,  3.2656, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -3.4219,  1.6562,  1.4375, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5312, -1.5000,  1.6797,  1.6641, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:59:00,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:59:00,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.60 | bwd_microstep: 67.99 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 66.94 | step_microstep: 1.33
[2025-11-06 18:59:00,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.38 | bwd: 68.93 | bwd_inner: 1.85 | bwd_allreduce: 66.97 | step: 1.41
 86%|████████▌ | 2999/3507 [1:14:14<10:28,  1.24s/it]                                                     {'loss': 0.6113, 'learning_rate': 1.0809360334342855e-06, 'epoch': 0.86}
 86%|████████▌ | 2999/3507 [1:14:14<10:28,  1.24s/it]tensor([[-5.0938, -4.0312,  0.8164,  3.2031, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:00,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.21 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7188, -2.9062,  1.4141,  2.3906, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -2.7656,  0.9453,  0.1641, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -0.4844,  3.7969, -0.7266, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6875, -2.4844,  0.4746,  3.7969, -0.5664]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -4.3125,  0.2383,  2.3594, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4219,  1.6719,  4.0625, -0.9141, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -3.4062,  0.1133,  3.0938, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:59:03,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.24 | optimizer_step: 0.25
[2025-11-06 18:59:03,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.98 | bwd_microstep: 2283.04 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2282.00 | step_microstep: 2.44
[2025-11-06 18:59:03,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 216.22 | bwd: 2283.98 | bwd_inner: 1.78 | bwd_allreduce: 2282.05 | step: 2.52
 86%|████████▌ | 3000/3507 [1:14:17<13:43,  1.62s/it]                                                     {'loss': 0.4454, 'learning_rate': 1.076762564271302e-06, 'epoch': 0.86}
 86%|████████▌ | 3000/3507 [1:14:17<13:43,  1.62s/it]tensor([[-6.1250, -3.6719,  1.4375,  1.2969, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -2.7031,  2.3125, -0.2871, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -2.8438,  1.3516,  2.4219, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -5.5625, -1.3281,  1.0547, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:03,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.97 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.6406, -2.1250,  0.7500,  1.3438, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.1250, -6.6562, -0.4844,  0.0962, -6.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -5.1562, -0.9570,  2.9688, -2.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0312, -5.0625, -0.4336,  2.1406, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:59:03,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 18:59:03,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.92 | bwd_microstep: 7.11 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 5.76 | step_microstep: 1.99
[2025-11-06 18:59:03,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.92 | bwd: 8.18 | bwd_inner: 2.18 | bwd_allreduce: 5.82 | step: 2.09
 86%|████████▌ | 3001/3507 [1:14:17<10:39,  1.26s/it]                                                     {'loss': 0.3418, 'learning_rate': 1.0725967090386702e-06, 'epoch': 0.86}
 86%|████████▌ | 3001/3507 [1:14:17<10:39,  1.26s/it]tensor([[-6.5625, -4.6250,  0.7773,  1.5000, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5938, -0.0830,  3.0781,  1.4766, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -3.7656, -0.4785,  3.3281, -1.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -3.8281,  1.0469,  1.8594, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -4.1562, -0.0598,  2.1406, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:04,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.68 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-4.5938, -4.8125, -1.2344,  3.2656, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -4.0625,  0.0168,  2.7344, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9375, -2.0312,  3.3594, -0.1128, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:59:05,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 18:59:05,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.15 | bwd_microstep: 818.20 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 817.02 | step_microstep: 1.93
[2025-11-06 18:59:05,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.85 | bwd: 819.30 | bwd_inner: 2.10 | bwd_allreduce: 817.07 | step: 2.04
 86%|████████▌ | 3002/3507 [1:14:19<11:17,  1.34s/it]                                                     {'loss': 0.4498, 'learning_rate': 1.068438471290988e-06, 'epoch': 0.86}
 86%|████████▌ | 3002/3507 [1:14:19<11:17,  1.34s/it]tensor([[-4.0625, -2.4375,  1.4922,  1.9844, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6250, -4.6875, -0.7344, -0.3379, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -3.7500,  0.7109,  2.4531, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -0.8398,  4.1250, -1.8516, -6.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:05,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.49 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.2188, -2.2344,  3.0781, -0.2793, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9062, -5.0000,  0.0659,  3.0312, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -4.2500,  1.1797,  1.7656, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938, -0.6875,  3.4844,  0.7031, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:05,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 18:59:05,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.38 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.56
[2025-11-06 18:59:05,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.90 | bwd: 2.92 | bwd_inner: 1.89 | bwd_allreduce: 0.90 | step: 1.65
 86%|████████▌ | 3003/3507 [1:14:19<09:06,  1.08s/it]                                                     {'loss': 0.3492, 'learning_rate': 1.064287854576359e-06, 'epoch': 0.86}
 86%|████████▌ | 3003/3507 [1:14:19<09:06,  1.08s/it]tensor([[-2.8281,  1.3750,  3.1875, -1.9844, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:06,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.99 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5625, -2.1562,  2.6562,  0.4062, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7500,  2.7500,  3.5312, -2.5625, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.8438, -4.5938,  1.2266, -0.7227, -6.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6875, -4.5000,  1.4062,  2.0469, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5781, -3.5938, -2.5156,  1.9688, -0.0593]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -2.6250,  1.9062,  2.1406, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -0.0703,  3.7188, -1.6953, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:59:07,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 18:59:07,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.05 | bwd_microstep: 1204.40 | bwd_inner_microstep: 7.97 | bwd_allreduce_microstep: 1196.32 | step_microstep: 1.77
[2025-11-06 18:59:07,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.05 | bwd: 1205.19 | bwd_inner: 8.70 | bwd_allreduce: 1196.35 | step: 1.84
 86%|████████▌ | 3004/3507 [1:14:21<10:44,  1.28s/it]                                                     {'loss': 0.5284, 'learning_rate': 1.0601448624363752e-06, 'epoch': 0.86}
 86%|████████▌ | 3004/3507 [1:14:21<10:44,  1.28s/it]tensor([[-7.9375, -5.8125,  0.4609,  1.5469, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -2.7969,  1.2734,  0.7500, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.7188, -5.3750,  0.6875,  1.3125, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -4.1562,  0.2373,  2.2656, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -2.7031,  1.7891,  3.5625, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -2.0938,  1.5703, -1.2734, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9062, -5.3750, -0.4004,  3.5312, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:08,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.77 | bwd_microstep: 3.57 | bwd_inner_microstep: 3.45 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1250, -1.4609,  3.7188,  0.8945, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:09,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 7.27 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:59:09,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.46 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.80 | step_microstep: 19.67
[2025-11-06 18:59:09,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 472.27 | bwd: 5.22 | bwd_inner: 4.25 | bwd_allreduce: 0.83 | step: 19.76
 86%|████████▌ | 3005/3507 [1:14:23<11:55,  1.43s/it]                                                     {'loss': 0.4955, 'learning_rate': 1.0560094984061276e-06, 'epoch': 0.86}
 86%|████████▌ | 3005/3507 [1:14:23<11:55,  1.43s/it]tensor([[-2.9844,  0.6758,  3.1094, -1.2109, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:09,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.28 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5938, -2.7031,  1.2578,  1.8984, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[3.6875, 6.7812, 5.5312, 1.5781, 1.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-1.8594,  2.0781,  3.2500, -2.1406, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -5.2188, -1.1562,  2.2188, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -2.6250,  1.7734,  1.6250, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.9062,  0.4043,  3.0156, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0938, -5.8750, -1.0156,  0.8711, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:59:10,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 5.28 | optimizer_gradients: 0.24 | optimizer_step: 0.20
[2025-11-06 18:59:10,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 68.41 | bwd_microstep: 1299.96 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1298.68 | step_microstep: 7.20
[2025-11-06 18:59:10,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 233.71 | bwd: 1300.87 | bwd_inner: 1.96 | bwd_allreduce: 1298.73 | step: 7.28
 86%|████████▌ | 3006/3507 [1:14:24<12:18,  1.47s/it]                                                     {'loss': 0.3498, 'learning_rate': 1.0518817660141977e-06, 'epoch': 0.86}
 86%|████████▌ | 3006/3507 [1:14:24<12:18,  1.47s/it]tensor([[-3.8125, -2.7188,  1.3203,  3.1406, -1.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -5.0625, -0.5039,  3.2812, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -4.0000, -0.8633, -1.7891, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6562, -4.7188,  0.4922, -0.5898, -5.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -0.2852,  2.0312, -2.8438, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000, -4.3438, -1.5312,  2.5469, -1.3672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1562, -1.8359,  1.2734,  0.6758, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:12,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.24 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0000, -4.6875, -0.8125,  2.9531, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:12,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:59:12,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.87 | bwd_microstep: 1.65 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.81 | step_microstep: 3.08
[2025-11-06 18:59:12,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.12 | bwd: 2.41 | bwd_inner: 1.42 | bwd_allreduce: 0.85 | step: 3.16
 86%|████████▌ | 3007/3507 [1:14:26<13:23,  1.61s/it]                                                     {'loss': 0.4906, 'learning_rate': 1.0477616687826597e-06, 'epoch': 0.86}
 86%|████████▌ | 3007/3507 [1:14:26<13:23,  1.61s/it]tensor([[-4.6250, -1.9844,  1.8984,  0.8086, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:12,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.12 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.7812,  2.2188,  1.5234, -1.8438, -1.5547]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.0938, -3.5938, -0.0117,  2.8594, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -2.2500,  2.7500, -0.5312, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -3.6875,  1.3828,  0.8125, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8750,  1.5000,  3.3750, -2.2656, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.5000, -3.9531, -0.9688,  1.7656, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0312, -5.1250,  1.2578,  2.7656, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:59:14,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 18:59:14,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.49 | bwd_microstep: 1001.47 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 1000.64 | step_microstep: 1.86
[2025-11-06 18:59:14,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.64 | bwd: 1002.15 | bwd_inner: 1.32 | bwd_allreduce: 1000.68 | step: 1.95
 86%|████████▌ | 3008/3507 [1:14:27<12:40,  1.52s/it]                                                     {'loss': 0.4547, 'learning_rate': 1.0436492102270646e-06, 'epoch': 0.86}
 86%|████████▌ | 3008/3507 [1:14:27<12:40,  1.52s/it]tensor([[-3.9531, -3.8438, -0.9258,  2.4219, -1.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -0.6445,  2.1406, -1.9375, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8984,  2.5625,  3.5469, -2.8594, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -0.2871,  3.1250, -1.2344, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -5.2188, -1.6875,  3.3594, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0312, -3.9531,  1.7266,  0.4395, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -5.0625, -1.4375,  1.6094, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:17,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.50 | bwd_microstep: 1.52 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.2656,  2.5938,  3.5000, -1.4844, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:59:17,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:59:17,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.00 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.94
[2025-11-06 18:59:17,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.49 | bwd: 3.46 | bwd_inner: 2.48 | bwd_allreduce: 0.86 | step: 2.03
 86%|████████▌ | 3009/3507 [1:14:31<17:05,  2.06s/it]                                                     {'loss': 0.2597, 'learning_rate': 1.0395443938564542e-06, 'epoch': 0.86}
 86%|████████▌ | 3009/3507 [1:14:31<17:05,  2.06s/it]tensor([[-2.6406,  1.7031,  4.3125, -1.1328, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6719, -4.0000, -0.4980,  4.0625, -0.8867]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3750, -6.7812, -2.0156,  1.6484, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:17,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.91 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6250, -3.1094,  1.8828,  1.2500, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -2.6250,  2.9531, -0.4414, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4688, -2.7969,  2.3125, -0.6406, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -4.7812, -0.9062,  2.9844, -2.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9375, -4.8438,  1.1719,  2.0625, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:59:17,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 18:59:17,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.71 | bwd_microstep: 112.99 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 111.99 | step_microstep: 1.81
[2025-11-06 18:59:17,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 304.64 | bwd: 113.79 | bwd_inner: 1.65 | bwd_allreduce: 112.03 | step: 1.89
 86%|████████▌ | 3010/3507 [1:14:31<13:03,  1.58s/it]                                                     {'loss': 0.3177, 'learning_rate': 1.035447223173337e-06, 'epoch': 0.86}
 86%|████████▌ | 3010/3507 [1:14:31<13:03,  1.58s/it]tensor([[-0.9414,  2.7031,  2.5781, -2.7188, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9844, -0.4746,  3.7812,  0.8438, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8750,  1.6484,  2.5781, -1.7656, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7812, -2.4062, -1.8281,  1.4453,  0.2051]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -3.8125, -0.0874,  3.0625, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -3.3906,  1.9141,  1.6406, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5781,  0.0154,  3.0000, -1.1250, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 18:59:19,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.47 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.2188, -3.6250, -1.5234,  2.2500, -0.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 18:59:19,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.81 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:59:19,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.67 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.74
[2025-11-06 18:59:19,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.17 | bwd: 2.91 | bwd_inner: 1.90 | bwd_allreduce: 0.89 | step: 2.82
 86%|████████▌ | 3011/3507 [1:14:33<14:17,  1.73s/it]                                                     {'loss': 1.0358, 'learning_rate': 1.031357701673713e-06, 'epoch': 0.86}
 86%|████████▌ | 3011/3507 [1:14:33<14:17,  1.73s/it]tensor([[-0.8672,  1.7188,  1.8125, -1.0469, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.9375,  0.8789,  3.7500, -2.8281, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -4.3438,  0.2559,  1.9609, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -1.9375,  1.8359, -0.8047, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -3.7969,  0.2412,  2.3438, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:20,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.93 | bwd_microstep: 5.52 | bwd_inner_microstep: 5.36 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
tensor([[-4.5938, -3.9062, -0.4551,  2.1250, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -2.9375,  0.6250,  1.6562, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812, -4.8125,  0.4043,  1.3438, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:20,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 18:59:20,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.96 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.23
[2025-11-06 18:59:20,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.90 | bwd: 7.40 | bwd_inner: 6.32 | bwd_allreduce: 0.90 | step: 2.34
 86%|████████▌ | 3012/3507 [1:14:34<11:02,  1.34s/it]                                                     {'loss': 0.3802, 'learning_rate': 1.0272758328470445e-06, 'epoch': 0.86}
 86%|████████▌ | 3012/3507 [1:14:34<11:02,  1.34s/it]tensor([[-4.2500, -3.7031, -0.1709,  2.3438, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719,  0.1045,  2.5156, -2.0156, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1875,  1.4531,  2.5000, -1.4844, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-1.0312,  2.8281,  3.0781, -2.3125, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2500, -4.0000,  0.1475,  2.0312, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000,  1.0781,  4.5312, -1.2734, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -0.2041,  3.7656, -2.0781, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:22,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.97 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.1094,  0.2715,  3.4844,  0.3984, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:22,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.15 | optimizer_step: 0.26
[2025-11-06 18:59:22,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.97 | bwd_microstep: 2.33 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1.04 | step_microstep: 3.00
[2025-11-06 18:59:22,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 492.96 | bwd: 3.21 | bwd_inner: 1.97 | bwd_allreduce: 1.08 | step: 3.09
 86%|████████▌ | 3013/3507 [1:14:36<13:54,  1.69s/it]                                                     {'loss': 0.3371, 'learning_rate': 1.0232016201762696e-06, 'epoch': 0.86}
 86%|████████▌ | 3013/3507 [1:14:36<13:54,  1.69s/it]tensor([[-6.8438, -4.5312,  1.1719,  1.6016, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -1.3750,  2.0469, -0.8086, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -2.3750,  2.2188,  1.1797, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:59:23,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.70 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.0312, -5.1562,  0.6680,  1.8828, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -4.6875, -1.3359,  1.4922, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.2031,  0.0317,  2.3594, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3750, -5.7812, -0.8945,  2.7812, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.9219,  0.3223,  3.5625, -1.5547, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:59:23,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:59:23,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.82 | bwd_microstep: 42.50 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 41.36 | step_microstep: 1.67
[2025-11-06 18:59:23,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.56 | bwd: 43.60 | bwd_inner: 1.99 | bwd_allreduce: 41.41 | step: 1.76
 86%|████████▌ | 3014/3507 [1:14:37<10:45,  1.31s/it]                                                     {'loss': 0.3837, 'learning_rate': 1.0191350671377898e-06, 'epoch': 0.86}
 86%|████████▌ | 3014/3507 [1:14:37<10:45,  1.31s/it]tensor([[-7.0938, -5.9375, -0.8438,  1.7344, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -2.6875,  1.1172,  0.2363, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -4.8750, -0.9141,  2.1406, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -1.8906,  1.1406, -0.3633, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.5000, -6.7500, -1.2734,  0.4219, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -4.9375,  0.2676,  2.3906, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1562, -3.6094,  0.1494, -3.0469, -6.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:24,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.27 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-3.8281, -0.7344,  2.5000, -0.1523, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:24,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:59:24,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.85 | bwd_microstep: 1.78 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.06
[2025-11-06 18:59:24,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.09 | bwd: 2.64 | bwd_inner: 1.62 | bwd_allreduce: 0.88 | step: 2.13
 86%|████████▌ | 3015/3507 [1:14:38<10:12,  1.25s/it]                                                     {'loss': 0.2609, 'learning_rate': 1.0150761772014739e-06, 'epoch': 0.86}
 86%|████████▌ | 3015/3507 [1:14:38<10:12,  1.25s/it]tensor([[-6.5000e+00, -3.4688e+00,  1.5625e+00, -4.5013e-04, -5.1250e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.1875, -4.4375,  1.1328,  0.4629, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:24,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.82 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -0.5898,  3.1562, -1.7188, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -2.8594,  1.0781,  3.0469, -1.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -4.0312,  1.0703,  2.4375, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7031,  0.7188,  2.4688, -1.6094, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6875,  0.7656,  3.1562, -0.0084, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4844, -2.3906,  0.1196,  1.5938, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:59:25,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 18:59:25,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.17 | bwd_microstep: 96.96 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 95.74 | step_microstep: 1.89
[2025-11-06 18:59:25,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.99 | bwd: 97.90 | bwd_inner: 1.96 | bwd_allreduce: 95.79 | step: 1.98
 86%|████████▌ | 3016/3507 [1:14:38<09:05,  1.11s/it]                                                     {'loss': 1.0096, 'learning_rate': 1.0110249538306493e-06, 'epoch': 0.86}
 86%|████████▌ | 3016/3507 [1:14:39<09:05,  1.11s/it]tensor([[-7.4688, -4.3125,  2.2188,  1.0469, -5.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250,  0.7930,  1.6719, -2.2656, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -2.1719,  2.7344,  3.8281, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -2.7188,  0.8242,  1.5781, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0000, -5.0312, -0.3281,  0.4707, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -3.2812, -0.9805,  1.1328, -1.9297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0312,  2.0469,  3.4375, -1.7266, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:27,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.57 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.0625, -3.1562,  2.4531,  1.4375, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:27,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:59:27,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.11 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.56
[2025-11-06 18:59:27,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 474.69 | bwd: 2.67 | bwd_inner: 1.69 | bwd_allreduce: 0.83 | step: 2.66
 86%|████████▌ | 3017/3507 [1:14:41<13:15,  1.62s/it]                                                     {'loss': 0.5259, 'learning_rate': 1.0069814004821033e-06, 'epoch': 0.86}
 86%|████████▌ | 3017/3507 [1:14:41<13:15,  1.62s/it]tensor([[-2.5469, -3.3281, -2.4375,  1.0938, -0.3320]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-7.0000, -3.5312,  2.6719,  0.7070, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:28,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.66 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-0.6484,  2.8750,  2.5938, -2.3125, -1.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.5156, -3.8594, -1.0703,  3.2969, -0.8164]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -2.6250,  2.5938,  1.0156, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1094,  2.9531,  4.6875, -1.3281, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7812, -4.6250, -0.2305,  2.3125, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2500, -0.0244,  1.9766, -1.2031, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:59:29,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 18:59:29,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.60 | bwd_microstep: 869.73 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 868.66 | step_microstep: 1.99
[2025-11-06 18:59:29,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 360.29 | bwd: 870.35 | bwd_inner: 1.52 | bwd_allreduce: 868.69 | step: 2.06
 86%|████████▌ | 3018/3507 [1:14:43<12:21,  1.52s/it]                                                     {'loss': 0.3823, 'learning_rate': 1.0029455206060778e-06, 'epoch': 0.86}
 86%|████████▌ | 3018/3507 [1:14:43<12:21,  1.52s/it]tensor([[-6.1250, -4.4375,  1.0078,  2.5781, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -4.3125,  1.0156,  3.3594, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -3.6875,  0.6836,  2.5000, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5625, -1.8828,  0.3516,  4.1250,  0.5234]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.1250,  2.7500,  3.0312, -2.6875, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.4688, -3.7500,  1.9297,  1.5000, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -4.1250,  0.1553,  1.8203, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:30,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.62 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0000, -4.0000,  0.0996,  2.4219, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:31,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:59:31,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.36 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.16
[2025-11-06 18:59:31,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.95 | bwd: 2.77 | bwd_inner: 1.78 | bwd_allreduce: 0.87 | step: 2.23
 86%|████████▌ | 3019/3507 [1:14:44<12:52,  1.58s/it]                                                     {'loss': 0.2672, 'learning_rate': 9.989173176462708e-07, 'epoch': 0.86}
 86%|████████▌ | 3019/3507 [1:14:44<12:52,  1.58s/it]tensor([[0.0757, 0.8594, 3.2812, 4.8750, 1.0391]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.0312,  0.7930,  0.5820, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:31,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.60 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4062, -4.2500,  0.0330,  2.2344, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -4.4062, -0.8516,  1.9453, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -4.5312, -1.0156,  2.2969, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062e+00, -3.2656e+00,  1.5391e+00,  3.2654e-03, -5.0625e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3125, -6.7188, -1.8438,  1.8281, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -4.2500,  0.2715,  3.5312, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:31,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:59:31,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.19 | bwd_microstep: 1.55 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.73 | step_microstep: 1.62
[2025-11-06 18:59:31,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.80 | bwd: 2.21 | bwd_inner: 1.32 | bwd_allreduce: 0.77 | step: 1.70
 86%|████████▌ | 3020/3507 [1:14:45<10:10,  1.25s/it]                                                     {'loss': 0.6352, 'learning_rate': 9.94896795039827e-07, 'epoch': 0.86}
 86%|████████▌ | 3020/3507 [1:14:45<10:10,  1.25s/it]tensor([[-6.5000, -5.9375, -1.2266,  2.2969, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -2.8281,  1.1719,  0.5664, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2188,  0.6055,  3.0312, -0.8398, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -3.7031,  0.9258,  0.8516, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8125, -4.8750,  1.0000,  2.2031, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -1.7812,  3.6250,  0.5117, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -5.5625, -1.3906,  1.9609, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:34,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.57 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2500, -2.6250,  1.8125,  0.7109, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:34,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.59 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 18:59:34,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.81 | bwd_microstep: 1.70 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.82 | step_microstep: 3.05
[2025-11-06 18:59:34,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 513.41 | bwd: 2.57 | bwd_inner: 1.58 | bwd_allreduce: 0.85 | step: 3.13
 86%|████████▌ | 3021/3507 [1:14:48<15:14,  1.88s/it]                                                     {'loss': 0.3759, 'learning_rate': 9.908839562173344e-07, 'epoch': 0.86}
 86%|████████▌ | 3021/3507 [1:14:48<15:14,  1.88s/it]tensor([[-4.3438, -4.8125, -1.6953,  2.9688, -1.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -4.3438, -0.0728,  3.3281, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.8125, -0.4688,  2.3750, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:35,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.11 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0625, -4.7188, -0.2715,  3.4688, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1250, -5.4688,  0.0420,  2.0312, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -4.5000, -1.0391,  2.7812, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1875,  1.6797,  3.3750, -1.6250, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -5.4375, -1.1953,  1.8516, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:35,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.81 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 18:59:35,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.65 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.66
[2025-11-06 18:59:35,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 364.79 | bwd: 2.76 | bwd_inner: 1.76 | bwd_allreduce: 0.85 | step: 2.74
 86%|████████▌ | 3022/3507 [1:14:49<11:37,  1.44s/it]                                                     {'loss': 0.4234, 'learning_rate': 9.868788046028266e-07, 'epoch': 0.86}
 86%|████████▌ | 3022/3507 [1:14:49<11:37,  1.44s/it]tensor([[-3.1406,  0.9414,  3.4062, -1.5781, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -4.3750,  0.3867,  1.4922, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -0.6562,  3.2188, -1.0703, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -3.8750,  0.1172,  2.4375, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -4.7500, -1.0859,  2.4375, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4688, -5.1250,  0.3789,  2.5938, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -0.0928,  2.6250, -2.1094, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:37,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.53 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.7188, -0.6758,  2.2344, -1.8750, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:37,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 18:59:37,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.44 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.13
[2025-11-06 18:59:37,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.00 | bwd: 3.03 | bwd_inner: 1.95 | bwd_allreduce: 0.94 | step: 2.21
 86%|████████▌ | 3023/3507 [1:14:51<13:03,  1.62s/it]                                                     {'loss': 0.1006, 'learning_rate': 9.828813436137829e-07, 'epoch': 0.86}
 86%|████████▌ | 3023/3507 [1:14:51<13:03,  1.62s/it]tensor([[-3.1875, -1.3125,  1.2031,  0.7734, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031,  0.5117,  3.5000, -1.4844, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:37,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.87 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.0312, -2.7344,  0.6406,  1.8828, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -2.8750,  1.4375,  2.0312, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -4.7500, -1.0859,  2.1406, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -5.5625, -1.9766,  2.7500, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -5.0312, -1.9688,  2.6250, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062,  0.5195,  3.9531, -2.5000, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:59:37,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.24 | optimizer_step: 0.30
[2025-11-06 18:59:37,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.97 | bwd_microstep: 51.66 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 50.42 | step_microstep: 2.51
[2025-11-06 18:59:37,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.87 | bwd: 52.45 | bwd_inner: 1.85 | bwd_allreduce: 50.46 | step: 2.58
 86%|████████▌ | 3024/3507 [1:14:51<10:08,  1.26s/it]                                                     {'loss': 0.1852, 'learning_rate': 9.788915766611151e-07, 'epoch': 0.86}
 86%|████████▌ | 3024/3507 [1:14:51<10:08,  1.26s/it]tensor([[-6.3438, -5.8438, -1.2031,  1.9922, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:37,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.88 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.5312, -6.7812, -1.9531,  1.3203, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4375,  1.2969,  4.1562,  0.1641, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6875, -1.6172,  3.5156, -0.3145, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969,  0.3203,  3.5312, -1.0781, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344,  0.0369,  3.9062,  0.0449, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5078,  1.8594,  1.1953, -3.0312, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.0938, -3.9531, -0.0767,  3.8750, -1.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:59:38,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:59:38,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.19 | bwd_microstep: 218.53 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 217.63 | step_microstep: 1.82
[2025-11-06 18:59:38,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 247.07 | bwd: 219.51 | bwd_inner: 1.68 | bwd_allreduce: 217.68 | step: 1.90
 86%|████████▋ | 3025/3507 [1:14:52<08:17,  1.03s/it]                                                     {'loss': 0.0946, 'learning_rate': 9.749095071491744e-07, 'epoch': 0.86}
 86%|████████▋ | 3025/3507 [1:14:52<08:17,  1.03s/it]tensor([[-5.5625, -1.7969,  2.0625, -1.4141, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -1.9766,  3.0938, -0.9219, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -3.0156,  1.0547, -0.9102, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -0.2305,  2.7656, -0.7227, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -1.3906,  2.7031, -1.1719, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.2188, -4.7500, -0.0991,  3.5312, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:38,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.52 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1250, -5.6562, -3.1406,  1.0234, -2.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2188, -4.9375, -0.2090,  1.8281, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:59:39,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:59:39,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.11 | bwd_microstep: 453.69 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 452.61 | step_microstep: 1.80
[2025-11-06 18:59:39,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.65 | bwd: 454.54 | bwd_inner: 1.74 | bwd_allreduce: 452.65 | step: 1.89
 86%|████████▋ | 3026/3507 [1:14:53<08:09,  1.02s/it]                                                     {'loss': 0.5805, 'learning_rate': 9.709351384757338e-07, 'epoch': 0.86}
 86%|████████▋ | 3026/3507 [1:14:53<08:09,  1.02s/it]tensor([[-7.0312, -5.2812, -0.0957,  1.0938, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.5156, -0.2432,  2.0000, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781,  0.5000,  2.7344, -2.3906, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.7812, -4.4688,  1.6406, -0.0287, -6.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -5.0938, -2.2188,  2.1406, -1.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:40,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.05 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.9062, -6.0000,  0.6328,  2.3594, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7578, -2.2656,  0.3066,  4.7188,  0.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7188, -4.5625,  0.2129,  2.4219, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:59:40,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 18:59:40,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.65 | bwd_microstep: 243.65 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 242.64 | step_microstep: 2.00
[2025-11-06 18:59:40,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.70 | bwd: 244.56 | bwd_inner: 1.73 | bwd_allreduce: 242.67 | step: 2.08
 86%|████████▋ | 3027/3507 [1:14:54<09:40,  1.21s/it]                                                     {'loss': 0.1263, 'learning_rate': 9.669684740320096e-07, 'epoch': 0.86}
 86%|████████▋ | 3027/3507 [1:14:54<09:40,  1.21s/it]tensor([[-4.5312, -4.4375, -0.7266,  3.0781, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:41,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.90 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -3.5625,  0.8672,  1.5859, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.8125, -7.9375, -2.8281,  0.0400, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -0.2266,  2.6875, -0.6953, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -2.6562,  0.9141,  0.4648, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -4.5938, -1.5547,  2.4844, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -1.7891,  3.3594, -0.6523, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5625, -4.4688,  1.0469,  1.9844, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:42,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 18:59:42,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.78 | bwd_microstep: 2.12 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.97
[2025-11-06 18:59:42,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 292.70 | bwd: 3.04 | bwd_inner: 2.02 | bwd_allreduce: 0.87 | step: 3.06
 86%|████████▋ | 3028/3507 [1:14:55<09:56,  1.25s/it]                                                     {'loss': 0.6096, 'learning_rate': 9.630095172026345e-07, 'epoch': 0.86}
 86%|████████▋ | 3028/3507 [1:14:56<09:56,  1.25s/it]tensor([[-2.9219,  0.3008,  4.5625,  2.1406, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906, -3.7344, -1.3359,  2.4531, -0.8711]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -2.4062,  2.7969,  0.2559, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344, -3.0938, -0.0342,  2.4844, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0938,  0.7109,  3.1250, -1.2656, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5312, -3.2344,  2.7031, -1.1562, -6.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:43,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.04 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9844, -0.2432,  1.2266, -0.9805, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.4375, -4.9688,  1.5000,  1.9062, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 18:59:44,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 18:59:44,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.44 | bwd_microstep: 34.11 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 32.98 | step_microstep: 3.03
[2025-11-06 18:59:44,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 501.51 | bwd: 35.08 | bwd_inner: 1.90 | bwd_allreduce: 33.03 | step: 3.13
 86%|████████▋ | 3029/3507 [1:14:58<12:12,  1.53s/it]                                                     {'loss': 0.4793, 'learning_rate': 9.59058271365667e-07, 'epoch': 0.86}
 86%|████████▋ | 3029/3507 [1:14:58<12:12,  1.53s/it]tensor([[-5.3750, -4.9062, -0.7266,  2.4844, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2871,  3.0938,  2.3281, -2.0469, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.4062, -4.5625, -0.7930,  3.5312, -1.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([3], device='cuda:3')
[2025-11-06 18:59:44,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.40 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1875, -3.9062, -0.1138,  1.4688, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3594, -3.1250, -2.9375,  0.3594, -0.2559]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -4.5312, -0.4434,  0.9844, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -3.1250,  1.4141,  4.2812, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3438, -2.6250,  3.0469,  0.2422, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:44,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 18:59:44,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.23 | bwd_microstep: 1.83 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.45
[2025-11-06 18:59:44,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.66 | bwd: 2.64 | bwd_inner: 1.79 | bwd_allreduce: 0.73 | step: 1.52
 86%|████████▋ | 3030/3507 [1:14:58<09:32,  1.20s/it]                                                     {'loss': 0.193, 'learning_rate': 9.551147398925853e-07, 'epoch': 0.86}
 86%|████████▋ | 3030/3507 [1:14:58<09:32,  1.20s/it]tensor([[-4.1250, -3.9375, -0.6250,  2.6094, -1.6797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[0.1133, 2.5625, 4.9062, 3.5312, 0.0850]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -4.8125,  0.0483,  2.7188, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -1.6953,  2.9688, -0.2295, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:45,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.22 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.8438, -3.2812,  2.6719,  0.2715, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -3.9844,  0.7656,  2.2031, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3750, -3.7344,  0.0156,  2.9531, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -2.8125,  2.7344, -0.0601, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 18:59:46,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 18:59:46,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.23 | bwd_microstep: 937.89 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 936.57 | step_microstep: 1.84
[2025-11-06 18:59:46,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.47 | bwd: 938.98 | bwd_inner: 2.23 | bwd_allreduce: 936.61 | step: 1.92
 86%|████████▋ | 3031/3507 [1:15:00<11:13,  1.42s/it]                                                     {'loss': 0.986, 'learning_rate': 9.511789261482929e-07, 'epoch': 0.86}
 86%|████████▋ | 3031/3507 [1:15:00<11:13,  1.42s/it]tensor([[-2.6094,  1.4297,  3.3281, -1.8516, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.7930,  2.6094,  2.4219, -2.0469, -1.8984]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:46,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.96 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.9688, -5.1562, -0.1436,  2.9062, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -5.0625,  0.9141,  2.2656, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -2.5469,  1.5234,  0.6680, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.2969,  1.0000,  1.9922, -1.3438, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.2188, -2.7656,  1.1484,  2.5156, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -3.4375,  0.6055,  1.6484, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 18:59:49,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 18:59:49,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.02 | bwd_microstep: 2884.22 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2883.12 | step_microstep: 2.13
[2025-11-06 18:59:49,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 300.00 | bwd: 2885.32 | bwd_inner: 2.00 | bwd_allreduce: 2883.17 | step: 2.21
 86%|████████▋ | 3032/3507 [1:15:03<15:29,  1.96s/it]                                                     {'loss': 0.7166, 'learning_rate': 9.472508334910946e-07, 'epoch': 0.86}
 86%|████████▋ | 3032/3507 [1:15:03<15:29,  1.96s/it]tensor([[-4.4062, -2.4688,  1.4688,  1.3906, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:50,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.93 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9844, -4.2188, -3.1406,  1.6719, -0.2773]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -3.2188,  0.3477,  1.8359, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -1.2422,  1.7031, -0.8516, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -3.0312,  0.9062,  0.5312, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2500, -2.0625,  0.9844,  2.6250, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1875, -3.9375, -2.0625,  2.1406, -0.5820]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -0.3594,  1.5859, -1.2578, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 18:59:52,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 18:59:52,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.12 | bwd_microstep: 2279.92 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 2278.84 | step_microstep: 1.91
[2025-11-06 18:59:52,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 288.08 | bwd: 2280.93 | bwd_inner: 1.90 | bwd_allreduce: 2278.88 | step: 2.00
 86%|████████▋ | 3033/3507 [1:15:06<16:58,  2.15s/it]                                                     {'loss': 0.5168, 'learning_rate': 9.433304652727149e-07, 'epoch': 0.86}
 86%|████████▋ | 3033/3507 [1:15:06<16:58,  2.15s/it]tensor([[-3.1094, -0.8477,  1.0781, -0.4590, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6953, -2.7344, -2.5000,  1.0469,  0.2832]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.1250, -3.7969,  1.5547,  1.9062, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:52,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.53 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.1562,  0.5391,  2.9844, -1.3594, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0156, -3.3906, -0.2871,  4.1562, -0.3926]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.8203,  2.5156,  4.3438,  0.4766, -1.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0000, -4.5312, -0.3496,  2.8906, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -4.7500, -0.2090,  2.6250, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 18:59:53,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 18:59:53,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.94 | bwd_microstep: 251.73 | bwd_inner_microstep: 1.39 | bwd_allreduce_microstep: 250.22 | step_microstep: 1.83
[2025-11-06 18:59:53,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.51 | bwd: 252.53 | bwd_inner: 2.12 | bwd_allreduce: 250.25 | step: 1.90
 87%|████████▋ | 3034/3507 [1:15:06<13:12,  1.68s/it]                                                     {'loss': 0.5399, 'learning_rate': 9.394178248382868e-07, 'epoch': 0.87}
 87%|████████▋ | 3034/3507 [1:15:06<13:12,  1.68s/it]tensor([[-4.3438, -4.0000, -0.9609,  1.8672, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:53,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.18 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.7188, -4.2812,  0.7188,  0.3750, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -2.8906,  0.9219, -1.7656, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.9453,  1.1094,  1.5859, -1.5156, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-7.0625, -4.3438,  1.0547,  0.4375, -5.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7812, -2.7656,  2.2344,  0.4785, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.5625, -8.4375, -3.7656,  0.9766, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -5.0000, -0.0659,  3.1094, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:59:53,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:59:53,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.28 | bwd_microstep: 398.30 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 397.34 | step_microstep: 2.40
[2025-11-06 18:59:53,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.46 | bwd: 399.08 | bwd_inner: 1.54 | bwd_allreduce: 397.38 | step: 2.49
 87%|████████▋ | 3035/3507 [1:15:07<11:03,  1.41s/it]                                                     {'loss': 0.7725, 'learning_rate': 9.355129155263498e-07, 'epoch': 0.87}
 87%|████████▋ | 3035/3507 [1:15:07<11:03,  1.41s/it]tensor([[-5.2500, -2.9531, -0.5664, -1.2109, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 18:59:54,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.66 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.6250, -0.5820,  2.3750, -1.9375, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9062, -5.4688, -0.4082,  1.5078, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -4.0312, -1.2500,  3.0625, -0.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4844, -3.8750, -1.0469,  3.4375, -0.7734]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -1.8359,  2.8750, -1.1562, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4688, -2.4219,  2.3750,  0.6016, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6250,  0.2695,  1.4141, -1.7266, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 18:59:55,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 18:59:55,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.68 | bwd_microstep: 1468.06 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1466.88 | step_microstep: 1.96
[2025-11-06 18:59:55,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.37 | bwd: 1468.91 | bwd_inner: 1.88 | bwd_allreduce: 1466.91 | step: 2.03
 87%|████████▋ | 3036/3507 [1:15:09<12:09,  1.55s/it]                                                     {'loss': 0.1617, 'learning_rate': 9.316157406688475e-07, 'epoch': 0.87}
 87%|████████▋ | 3036/3507 [1:15:09<12:09,  1.55s/it]tensor([[-5.6250, -3.3438,  1.1172,  1.0234, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.5312, -3.7812, -0.7617,  3.2344, -1.0078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -3.4688,  1.4609,  2.7969, -3.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.2500, -6.6875, -0.4121,  1.8281, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:56,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.54 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.6875, -2.7500,  1.2422,  3.6094, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9062, -5.5000, -0.8008,  0.8828, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.5156,  0.0156,  2.7344, -0.7930, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([2], device='cuda:2')
tensor([[-4.9688, -1.1953,  2.8281, -0.4727, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 18:59:56,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.25 | optimizer_step: 0.15
[2025-11-06 18:59:56,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.03 | bwd_microstep: 157.04 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 156.04 | step_microstep: 2.09
[2025-11-06 18:59:56,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.61 | bwd: 158.15 | bwd_inner: 1.92 | bwd_allreduce: 156.09 | step: 2.18
 87%|████████▋ | 3037/3507 [1:15:10<09:56,  1.27s/it]                                                     {'loss': 0.9066, 'learning_rate': 9.277263035911177e-07, 'epoch': 0.87}
 87%|████████▋ | 3037/3507 [1:15:10<09:56,  1.27s/it]tensor([[-5.9062, -5.6562, -1.4844,  2.3281, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8438,  0.7148,  3.4688, -2.4062, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5312, -4.1562,  0.9336, -0.9844, -6.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -4.1250, -1.5625,  1.7344, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4219, -4.0312, -1.6250,  2.6719, -0.7773]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 18:59:56,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.20 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6562, -3.8750, -0.2852,  2.1875, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[ 0.2070,  3.2188,  4.4688,  1.8359, -0.3574]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -5.5000, -2.9375,  1.9297, -1.5547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:01,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.32
[2025-11-06 19:00:01,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.69 | bwd_microstep: 4160.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 4159.00 | step_microstep: 2.48
[2025-11-06 19:00:01,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.91 | bwd: 4160.80 | bwd_inner: 1.60 | bwd_allreduce: 4159.05 | step: 2.56
 87%|████████▋ | 3038/3507 [1:15:15<18:38,  2.38s/it]                                                     {'loss': 0.6502, 'learning_rate': 9.238446076119001e-07, 'epoch': 0.87}
 87%|████████▋ | 3038/3507 [1:15:15<18:38,  2.38s/it]tensor([[-6.5625, -3.6875,  1.5391,  0.5156, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -2.1094,  2.2188, -0.5117, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -4.0312,  1.2031,  2.8594, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7344,  0.5742,  3.2031, -2.0781, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:01,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.65 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08
tensor([[-6.2500, -3.0469,  2.3750,  0.7773, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -3.9375,  0.3281,  1.6250, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -1.2812,  2.6094,  1.2266, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -6.0312, -0.6133,  2.7344, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:01,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:00:01,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.43 | bwd_microstep: 1.64 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.66 | step_microstep: 1.43
[2025-11-06 19:00:01,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 444.11 | bwd: 2.52 | bwd_inner: 1.67 | bwd_allreduce: 0.71 | step: 1.52
 87%|████████▋ | 3039/3507 [1:15:15<14:09,  1.81s/it]                                                     {'loss': 0.2981, 'learning_rate': 9.19970656043333e-07, 'epoch': 0.87}
 87%|████████▋ | 3039/3507 [1:15:15<14:09,  1.81s/it]tensor([[-3.7656, -3.2344, -0.4473,  1.8438, -1.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -5.1875, -1.5156,  3.0000, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9844, -3.4844,  0.5430,  3.5469, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:02,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.23 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8125, -3.9531, -0.1494,  2.3438, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -0.7070,  2.1562,  1.1094, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -2.0312,  2.0625,  4.2812, -1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -3.8750,  0.3398,  2.7031, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5938, -3.6094,  0.8477,  3.3125, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:03,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:00:03,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.19 | bwd_microstep: 1432.93 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1431.75 | step_microstep: 1.82
[2025-11-06 19:00:03,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.45 | bwd: 1433.80 | bwd_inner: 1.89 | bwd_allreduce: 1431.79 | step: 1.90
 87%|████████▋ | 3040/3507 [1:15:17<14:07,  1.81s/it]                                                     {'loss': 0.4119, 'learning_rate': 9.161044521909412e-07, 'epoch': 0.87}
 87%|████████▋ | 3040/3507 [1:15:17<14:07,  1.81s/it]tensor([[-1.2109,  2.2500,  2.6094, -1.9531, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:03,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.15 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.0938,  0.5117,  3.7969,  0.1670, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2812, -6.2812, -0.6172,  2.6250, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -1.8125,  2.4688, -0.3164, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -4.8125, -1.3125,  2.2188, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.4844,  1.2266,  2.5625, -1.8672, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -5.4688, -1.9375,  2.6094, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.4531,  2.1250,  2.7500, -1.5781, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 19:00:04,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:00:04,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.25 | bwd_microstep: 87.71 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 86.86 | step_microstep: 1.84
[2025-11-06 19:00:04,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.43 | bwd: 88.48 | bwd_inner: 1.46 | bwd_allreduce: 86.90 | step: 1.91
 87%|████████▋ | 3041/3507 [1:15:17<10:52,  1.40s/it]                                                     {'loss': 0.2653, 'learning_rate': 9.122459993536392e-07, 'epoch': 0.87}
 87%|████████▋ | 3041/3507 [1:15:17<10:52,  1.40s/it]tensor([[-4.3750, -1.2109,  2.4531,  0.3223, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -5.5000, -1.4062,  2.5625, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -1.8828,  1.5469, -1.2969, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:04,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.58 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0938, -3.1719,  1.5469,  2.1719, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -2.7812,  2.4219, -0.0835, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -1.7188,  2.4844, -2.9531, -6.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6719, -3.6719, -2.5938,  1.7969, -0.1196]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.4688, -4.3438, -0.6797,  2.9219, -1.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:06,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:00:06,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.79 | bwd_microstep: 1543.93 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1542.88 | step_microstep: 1.81
[2025-11-06 19:00:06,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.39 | bwd: 1544.77 | bwd_inner: 1.68 | bwd_allreduce: 1542.93 | step: 1.89
 87%|████████▋ | 3042/3507 [1:15:19<12:12,  1.58s/it]                                                     {'loss': 0.3631, 'learning_rate': 9.083953008237311e-07, 'epoch': 0.87}
 87%|████████▋ | 3042/3507 [1:15:19<12:12,  1.58s/it]tensor([[-3.8594, -0.7500,  3.4531,  1.2188, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -3.7969,  0.2754,  2.5000, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -1.9609,  1.9688, -0.1582, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9062, -4.2188,  1.6875,  1.2266, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0625, -1.9922,  3.1094,  1.4062, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:06,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.56 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.8125, -2.1094,  1.6953,  4.5625, -0.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -4.4375,  1.0234,  2.1562, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7500, -3.9531,  0.1621,  2.8594, -2.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:00:06,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.22 | optimizer_step: 5.91
[2025-11-06 19:00:06,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.07 | bwd_microstep: 14.14 | bwd_inner_microstep: 1.19 | bwd_allreduce_microstep: 12.84 | step_microstep: 7.70
[2025-11-06 19:00:06,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 426.64 | bwd: 14.97 | bwd_inner: 1.92 | bwd_allreduce: 12.88 | step: 7.79
 87%|████████▋ | 3043/3507 [1:15:20<09:40,  1.25s/it]                                                     {'loss': 0.3217, 'learning_rate': 9.045523598869011e-07, 'epoch': 0.87}
 87%|████████▋ | 3043/3507 [1:15:20<09:40,  1.25s/it]tensor([[-4.2812, -1.1875,  2.0938, -0.4023, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -4.7500, -0.4277,  2.8906, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.5625,  2.7500,  2.1250, -2.0312, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 19:00:06,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.40 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1562,  0.8789,  4.3750, -2.4531, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6875, -4.8438,  0.3047,  1.1328, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1562, -3.9844, -2.6250,  1.3828, -0.6992]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-2.4844, -3.5781, -2.4219,  1.9375, -0.0364]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.8750, -5.5000, -3.1562,  1.3906, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:08,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 19:00:08,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.16 | bwd_microstep: 1871.21 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 1870.30 | step_microstep: 2.13
[2025-11-06 19:00:08,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.60 | bwd: 1872.09 | bwd_inner: 1.58 | bwd_allreduce: 1870.35 | step: 2.21
 87%|████████▋ | 3044/3507 [1:15:22<12:05,  1.57s/it]                                                     {'loss': 0.6791, 'learning_rate': 9.007171798222136e-07, 'epoch': 0.87}
 87%|████████▋ | 3044/3507 [1:15:22<12:05,  1.57s/it]tensor([[-4.6562, -2.0000,  1.6172,  0.0693, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9375, -6.0938, -0.5938,  2.9688, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:09,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.94 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.6875, -3.3438,  0.4336,  1.9375, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0781,  2.4688,  3.4844, -3.2500, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0000, -0.6484,  2.4531, -0.5508, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.0938, -0.2109,  2.3125, -1.8203, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.6250, -4.7500,  2.0156, -0.6016, -7.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.7500, -6.3438, -0.8750, -0.6797, -6.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:09,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:00:09,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.76 | bwd_microstep: 41.59 | bwd_inner_microstep: 1.56 | bwd_allreduce_microstep: 39.95 | step_microstep: 2.09
[2025-11-06 19:00:09,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.72 | bwd: 42.60 | bwd_inner: 2.49 | bwd_allreduce: 39.99 | step: 2.16
 87%|████████▋ | 3045/3507 [1:15:23<09:28,  1.23s/it]                                                     {'loss': 0.9069, 'learning_rate': 8.968897639021157e-07, 'epoch': 0.87}
 87%|████████▋ | 3045/3507 [1:15:23<09:28,  1.23s/it]tensor([[-5.6875, -3.7969,  0.5742,  1.2344, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6289,  2.4531,  3.0938, -0.0908, -1.2422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9062, -5.6562, -1.2109,  2.6719, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -0.3301,  2.7344, -1.9844, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0156,  1.8047,  2.1875, -2.5625, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.4531,  1.0078,  2.7812, -0.8750, -2.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:00:09,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.81 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1875, -1.9688,  1.0781,  4.3750, -0.1572]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2812, -1.8359,  2.1562,  1.3984, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:12,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.18 | optimizer_step: 0.23
[2025-11-06 19:00:12,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.44 | bwd_microstep: 2144.82 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2143.63 | step_microstep: 2.33
[2025-11-06 19:00:12,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.27 | bwd: 2145.78 | bwd_inner: 1.98 | bwd_allreduce: 2143.68 | step: 2.42
 87%|████████▋ | 3046/3507 [1:15:26<13:12,  1.72s/it]                                                     {'loss': 0.5389, 'learning_rate': 8.930701153924215e-07, 'epoch': 0.87}
 87%|████████▋ | 3046/3507 [1:15:26<13:12,  1.72s/it]tensor([[-6.7500, -3.5000,  1.8516, -0.0178, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:12,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.13 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.1562, -2.6094,  1.1875, -0.2061, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -4.4062, -0.0537,  1.2500, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -5.6875, -2.1250,  2.0312, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -5.3125, -0.0439,  2.0781, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -4.3750,  0.0703,  2.4531, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -1.3438,  2.0781, -2.2344, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -2.9688,  2.7188,  1.7344, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:12,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:00:12,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.50 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.66 | step_microstep: 2.11
[2025-11-06 19:00:12,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.65 | bwd: 2.51 | bwd_inner: 1.70 | bwd_allreduce: 0.69 | step: 2.18
 87%|████████▋ | 3047/3507 [1:15:26<10:15,  1.34s/it]                                                     {'loss': 0.1555, 'learning_rate': 8.892582375523296e-07, 'epoch': 0.87}
 87%|████████▋ | 3047/3507 [1:15:26<10:15,  1.34s/it]tensor([[-5.4688, -3.6562,  0.7617,  1.6016, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -4.1875, -0.6445,  2.4062, -2.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -4.9688, -0.2080,  2.3281, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -1.1094,  2.4219, -1.3516, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:12,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.22 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.8125, -3.7188,  0.2832,  2.2031, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -6.0312, -3.5781,  1.2344, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -1.2500,  2.8125, -1.6641, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6562, -4.9688,  0.6641,  2.4375, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:15,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.58 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:00:15,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.26 | bwd_microstep: 2256.84 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 2255.88 | step_microstep: 3.54
[2025-11-06 19:00:15,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.52 | bwd: 2257.57 | bwd_inner: 1.50 | bwd_allreduce: 2255.93 | step: 3.63
 87%|████████▋ | 3048/3507 [1:15:29<13:37,  1.78s/it]                                                     {'loss': 0.1171, 'learning_rate': 8.854541336343947e-07, 'epoch': 0.87}
 87%|████████▋ | 3048/3507 [1:15:29<13:37,  1.78s/it]tensor([[-3.3281,  0.1396,  2.9688, -0.9180, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8125, -5.2188, -0.8125,  2.5938, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -3.1875,  1.0078,  2.6250, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -2.5938,  3.0312,  0.0820, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:15,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.36 | bwd_microstep: 1.12 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.4375, -4.0938, -0.6133,  2.5000, -2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3906, -0.2617,  1.7422,  0.7109, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -3.2031,  0.6406,  1.2578, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -2.1250,  2.8438,  0.6328, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:00:15,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.29 | optimizer_step: 0.22
[2025-11-06 19:00:15,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.72 | bwd_microstep: 16.17 | bwd_inner_microstep: 1.29 | bwd_allreduce_microstep: 14.76 | step_microstep: 2.44
[2025-11-06 19:00:15,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 421.12 | bwd: 17.29 | bwd_inner: 2.28 | bwd_allreduce: 14.81 | step: 2.57
 87%|████████▋ | 3049/3507 [1:15:29<10:39,  1.40s/it]                                                     {'loss': 0.8605, 'learning_rate': 8.816578068845472e-07, 'epoch': 0.87}
 87%|████████▋ | 3049/3507 [1:15:29<10:39,  1.40s/it]tensor([[-6.3750, -2.7188,  2.1094, -0.9062, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -4.0625, -1.7969,  3.1406, -0.3223]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4062, -3.8906,  1.8828,  1.5391, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -3.9688, -0.1514, -0.0747, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531,  1.1016,  3.8438, -2.3438, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -2.5938,  1.7266, -1.2734, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:17,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.30 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6875, -3.6562,  0.5391,  0.4668, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0469, -1.2109,  1.6953,  1.4766, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:19,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.64 | optimizer_step: 0.67
[2025-11-06 19:00:19,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 258.23 | bwd_microstep: 1500.04 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 1498.82 | step_microstep: 309.50
[2025-11-06 19:00:19,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 390.54 | bwd: 1501.02 | bwd_inner: 1.98 | bwd_allreduce: 1498.88 | step: 309.61
 87%|████████▋ | 3050/3507 [1:15:33<16:09,  2.12s/it]                                                     {'loss': 0.4138, 'learning_rate': 8.778692605420747e-07, 'epoch': 0.87}
 87%|████████▋ | 3050/3507 [1:15:33<16:09,  2.12s/it]tensor([[-3.8125, -0.7969,  2.4688,  0.2344, -3.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:19,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.77 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.2188, -2.7812,  1.2422,  0.2197, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5781, -1.5391,  2.1719,  2.2812, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -4.3125,  0.8281,  2.1094, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -4.2812,  0.7070,  2.1406, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -4.7812, -0.5273,  1.4062, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -4.7500,  1.1406,  2.4062, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -1.0312,  2.1250, -1.7031, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:20,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 19:00:20,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.59 | bwd_microstep: 203.95 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 202.89 | step_microstep: 1.57
[2025-11-06 19:00:20,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.38 | bwd: 204.92 | bwd_inner: 1.83 | bwd_allreduce: 202.94 | step: 1.68
 87%|████████▋ | 3051/3507 [1:15:34<12:35,  1.66s/it]                                                     {'loss': 0.2557, 'learning_rate': 8.740884978396358e-07, 'epoch': 0.87}
 87%|████████▋ | 3051/3507 [1:15:34<12:35,  1.66s/it]tensor([[-3.2031,  1.0156,  3.5312, -1.6406, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -3.6719, -0.4414,  1.7188, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:20,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 99.60 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -1.5938,  2.2969,  0.0168, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3750, -2.8906,  1.8672, -0.9141, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.0000,  0.7773,  2.4688, -2.1094, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8750, -3.8125, -0.9844,  2.4219, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5625, -6.5312, -0.7344,  2.4688, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -3.3906,  0.8203,  1.2578, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:21,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:00:21,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.94 | bwd_microstep: 971.21 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 970.19 | step_microstep: 2.35
[2025-11-06 19:00:21,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 248.54 | bwd: 972.11 | bwd_inner: 1.71 | bwd_allreduce: 970.24 | step: 2.45
 87%|████████▋ | 3052/3507 [1:15:35<11:39,  1.54s/it]                                                     {'loss': 0.1487, 'learning_rate': 8.703155220032378e-07, 'epoch': 0.87}
 87%|████████▋ | 3052/3507 [1:15:35<11:39,  1.54s/it]tensor([[-5.8125, -5.6562, -1.9375,  1.7812, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -1.5703,  2.0000, -0.0859, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:21,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.69 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.20
tensor([[-2.6094, -3.3906, -2.4531,  1.4141, -0.3398]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-2.7188,  1.6797,  3.2656, -2.5781, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -3.5000,  0.4902,  3.0781, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[ 1.4141,  5.0000,  5.5625,  0.5859, -0.2275]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -4.7188, -0.7070,  2.8125, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125,  0.3809,  4.0938, -1.9062, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:00:22,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:00:22,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.58 | bwd_microstep: 162.10 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 160.60 | step_microstep: 2.00
[2025-11-06 19:00:22,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.29 | bwd: 163.18 | bwd_inner: 2.39 | bwd_allreduce: 160.64 | step: 2.20
 87%|████████▋ | 3053/3507 [1:15:35<09:20,  1.24s/it]                                                     {'loss': 0.3668, 'learning_rate': 8.665503362522509e-07, 'epoch': 0.87}
 87%|████████▋ | 3053/3507 [1:15:35<09:20,  1.24s/it]tensor([[-6.0312, -2.6250,  1.1094, -1.7500, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -3.0312,  1.2500,  1.4375, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -4.1562, -0.1152,  3.2031, -1.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:22,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.56 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.3438, -1.9609,  3.0469,  0.7578, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6875, -5.2500, -0.7031, -1.3125, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -2.4219,  1.3125,  1.2266, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5000, -4.9062, -0.0913,  1.2656, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6250, -3.9688,  1.4219, -1.0859, -6.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:23,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:00:23,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.32 | bwd_microstep: 766.66 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 765.44 | step_microstep: 1.66
[2025-11-06 19:00:23,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.91 | bwd: 767.76 | bwd_inner: 2.09 | bwd_allreduce: 765.50 | step: 1.77
 87%|████████▋ | 3054/3507 [1:15:37<09:08,  1.21s/it]                                                     {'loss': 0.4541, 'learning_rate': 8.627929437993898e-07, 'epoch': 0.87}
 87%|████████▋ | 3054/3507 [1:15:37<09:08,  1.21s/it]tensor([[-2.0312,  2.0156,  2.9844, -1.8125, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0469, -3.9219, -2.0312,  2.5781, -0.3887]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 19:00:23,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.33 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.0000, -1.1484,  2.0156, -1.9453, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.9375, -5.6250, -0.2100,  0.3906, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7031, -4.3750, -2.5156,  1.6406, -1.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -5.4062, -1.5391,  2.5938, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -2.8438,  1.3672,  1.4766, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7812, -7.0000, -2.9531, -0.1748, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:00:24,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 19:00:24,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.84 | bwd_microstep: 855.03 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 853.76 | step_microstep: 2.23
[2025-11-06 19:00:24,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.19 | bwd: 855.87 | bwd_inner: 1.96 | bwd_allreduce: 853.79 | step: 2.30
 87%|████████▋ | 3055/3507 [1:15:38<09:12,  1.22s/it]                                                     {'loss': 0.6625, 'learning_rate': 8.590433478507287e-07, 'epoch': 0.87}
 87%|████████▋ | 3055/3507 [1:15:38<09:12,  1.22s/it]tensor([[-5.5938, -4.2188, -0.0062,  1.4844, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -1.3828,  1.8125, -0.2041, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -1.1328,  2.8750, -1.7891, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -3.3906,  1.5000,  0.7500, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344,  0.8398,  3.8594, -0.7617, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -3.9844, -0.0962,  2.6719, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -4.5625,  0.9805,  2.4062, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:26,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.58 | bwd_microstep: 1.17 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15
tensor([[-5.9688, -4.0625,  0.7656,  1.4688, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:27,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:00:27,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 429.19 | bwd_microstep: 1.86 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.78 | step_microstep: 2.13
[2025-11-06 19:00:27,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 713.82 | bwd: 3.04 | bwd_inner: 2.00 | bwd_allreduce: 0.86 | step: 2.27
 87%|████████▋ | 3056/3507 [1:15:41<12:56,  1.72s/it]                                                     {'loss': 0.8121, 'learning_rate': 8.553015516056839e-07, 'epoch': 0.87}
 87%|████████▋ | 3056/3507 [1:15:41<12:56,  1.72s/it]tensor([[-4.5938, -2.7031,  1.1719,  1.0859, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.8125, -6.7812, -1.8984,  1.0234, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:27,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.55 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.6250, -2.9375,  1.3438,  0.1338, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -1.5000,  3.2656,  0.3887, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.5000, -4.8125,  1.1016,  1.0234, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7500,  0.6484,  2.1250, -1.2656, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -2.2188,  2.6406,  0.3555, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -0.8477,  0.3613, -5.1250, -5.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 19:00:27,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:00:27,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 78.66 | bwd_microstep: 204.24 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 203.23 | step_microstep: 1.46
[2025-11-06 19:00:27,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 248.22 | bwd: 205.10 | bwd_inner: 1.71 | bwd_allreduce: 203.26 | step: 1.54
 87%|████████▋ | 3057/3507 [1:15:41<10:07,  1.35s/it]                                                     {'loss': 0.4635, 'learning_rate': 8.515675582570181e-07, 'epoch': 0.87}
 87%|████████▋ | 3057/3507 [1:15:41<10:07,  1.35s/it]tensor([[-5.7188, -3.2969,  1.5781,  1.1719, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -3.9375, -0.1221,  3.6094, -1.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -1.2969,  2.3125, -0.5312, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:28,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.91 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4531, -4.4688, -2.9688,  1.7344, -0.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -4.2812,  1.5547,  1.6797, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7422,  2.7656,  1.8984, -2.8750, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9062, -1.8438,  3.7344,  0.0859, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -1.0469,  2.3750, -2.7188, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:31,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 19:00:31,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.22 | bwd_microstep: 3083.37 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 3082.42 | step_microstep: 2.42
[2025-11-06 19:00:31,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.15 | bwd: 3084.32 | bwd_inner: 1.72 | bwd_allreduce: 3082.47 | step: 2.49
 87%|████████▋ | 3058/3507 [1:15:45<14:49,  1.98s/it]                                                     {'loss': 0.7067, 'learning_rate': 8.478413709908351e-07, 'epoch': 0.87}
 87%|████████▋ | 3058/3507 [1:15:45<14:49,  1.98s/it]tensor([[-0.6211,  3.4219,  5.4688,  0.3984, -1.7891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:31,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 79.00 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.1250, -4.2812,  1.6250,  3.0625, -3.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.3906,  0.8008,  1.0547, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -4.3125,  0.1123,  1.8516, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094,  1.6406,  3.2031, -2.3125, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -3.5625,  0.7773,  3.6562, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -2.5469,  1.5078, -0.2295, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -3.7031,  0.6172,  2.4375, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:32,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:00:32,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.67 | bwd_microstep: 273.77 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 272.67 | step_microstep: 1.58
[2025-11-06 19:00:32,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.70 | bwd: 274.82 | bwd_inner: 1.96 | bwd_allreduce: 272.72 | step: 1.68
 87%|████████▋ | 3059/3507 [1:15:45<11:55,  1.60s/it]                                                     {'loss': 0.4707, 'learning_rate': 8.44122992986578e-07, 'epoch': 0.87}
 87%|████████▋ | 3059/3507 [1:15:45<11:55,  1.60s/it]tensor([[-4.8438, -3.3594,  0.6680,  2.0312, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -4.0000, -0.0498,  2.9844, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:32,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.20 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[1.5156, 4.8750, 5.1250, 0.7383, 0.0179]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -4.0000, -0.1338,  2.1562, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -2.5625,  1.6016, -0.6836, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5312, -2.3125,  1.3906,  2.7656, -1.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9062, -5.8750, -1.7891,  2.2344, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5469, -2.4375,  1.3672,  3.2656, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:32,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:00:32,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.30 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.82 | step_microstep: 1.76
[2025-11-06 19:00:32,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.53 | bwd: 2.77 | bwd_inner: 1.77 | bwd_allreduce: 0.86 | step: 1.85
 87%|████████▋ | 3060/3507 [1:15:46<09:13,  1.24s/it]                                                     {'loss': 0.363, 'learning_rate': 8.404124274170278e-07, 'epoch': 0.87}
 87%|████████▋ | 3060/3507 [1:15:46<09:13,  1.24s/it]tensor([[-2.3281, -3.2969, -2.2344,  1.8828,  0.0408]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.6992, -1.4453, -0.2480,  3.6406,  1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -0.0374,  2.4844, -1.3359, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.3750, -4.5312, -1.0859,  3.2188, -1.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:32,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.07 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.5156,  0.1738,  3.0781, -0.6914, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8125, -4.5000,  0.6406,  0.7461, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.6875, -3.5312,  1.2891,  1.1094, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -2.3281,  2.5312,  1.2734, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:32,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.23 | optimizer_step: 0.24
[2025-11-06 19:00:32,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.96 | bwd_microstep: 64.95 | bwd_inner_microstep: 6.95 | bwd_allreduce_microstep: 57.89 | step_microstep: 2.47
[2025-11-06 19:00:32,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.06 | bwd: 65.82 | bwd_inner: 7.71 | bwd_allreduce: 57.95 | step: 2.56
 87%|████████▋ | 3061/3507 [1:15:46<07:28,  1.01s/it]                                                     {'loss': 0.7477, 'learning_rate': 8.367096774482996e-07, 'epoch': 0.87}
 87%|████████▋ | 3061/3507 [1:15:46<07:28,  1.01s/it]tensor([[-4.7812, -3.9375, -0.6680,  0.9766, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:33,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.36 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-7.0625, -4.7188,  1.3672,  1.7734, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.6562,  0.3555,  2.6094,  0.3457, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438, -3.4844, -1.1406,  3.2031, -0.2852]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-1.0703,  2.7031,  3.0625, -1.3828, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -0.0562,  3.4688, -1.2344, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -0.8867,  3.4531, -1.0078, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -3.2188,  1.6328,  1.0547, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:35,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:00:35,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 243.35 | bwd_microstep: 2156.72 | bwd_inner_microstep: 5.23 | bwd_allreduce_microstep: 2151.40 | step_microstep: 1.89
[2025-11-06 19:00:35,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 373.72 | bwd: 2157.65 | bwd_inner: 6.00 | bwd_allreduce: 2151.47 | step: 2.00
 87%|████████▋ | 3062/3507 [1:15:49<11:14,  1.52s/it]                                                     {'loss': 0.74, 'learning_rate': 8.330147462398353e-07, 'epoch': 0.87}
 87%|████████▋ | 3062/3507 [1:15:49<11:14,  1.52s/it]tensor([[-5.1562, -2.9844,  1.7578,  1.8359, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -4.0938, -1.2422,  2.2812, -1.6328]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -0.2168,  3.6250, -2.6094, -5.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:35,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.65 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.7188, -4.1250,  0.8828,  0.1592, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0312, -5.2812, -0.6328,  2.4688, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -1.1641,  3.7500, -0.5703, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.0625,  2.4531,  2.8594, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -3.6875, -1.7188,  1.5781, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:36,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:00:36,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.11 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.34
[2025-11-06 19:00:36,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.78 | bwd: 2.69 | bwd_inner: 1.73 | bwd_allreduce: 0.82 | step: 2.42
 87%|████████▋ | 3063/3507 [1:15:50<09:05,  1.23s/it]                                                     {'loss': 0.2897, 'learning_rate': 8.293276369444114e-07, 'epoch': 0.87}
 87%|████████▋ | 3063/3507 [1:15:50<09:05,  1.23s/it]tensor([[-5.9688, -2.9062,  1.5234, -0.2236, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -4.5625, -1.0000,  2.0469, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:36,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.49 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.1562, -3.0625,  3.0781, -0.4219, -6.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7500, -3.1250,  1.8672, -1.1172, -5.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1719,  1.1797,  3.6406, -1.9922, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -1.3047,  2.5625, -0.6562, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.5625,  1.1172,  1.5781, -3.0000, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4688, -2.8750,  2.9219,  0.3184, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:00:43,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 19:00:43,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.84 | bwd_microstep: 5465.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 5464.98 | step_microstep: 2.45
[2025-11-06 19:00:43,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.34 | bwd: 5466.60 | bwd_inner: 1.41 | bwd_allreduce: 5465.04 | step: 2.53
 87%|████████▋ | 3064/3507 [1:15:57<22:15,  3.02s/it]                                                     {'loss': 0.3499, 'learning_rate': 8.256483527081305e-07, 'epoch': 0.87}
 87%|████████▋ | 3064/3507 [1:15:57<22:15,  3.02s/it]tensor([[-2.4375,  1.4609,  2.3594, -2.6719, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.6875, -4.6250, -1.2812,  2.5156, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:43,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.46 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.2188, -1.3047,  2.2656,  0.6680, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -3.0312,  1.0156,  1.0938, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.7188, -0.1338,  2.6250, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -4.7188, -0.2910,  2.2812, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062e+00, -2.2031e+00,  9.5703e-01,  3.1128e-03, -3.4219e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0312, -4.9062,  0.1533,  2.8125, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:00:43,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:00:43,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.02 | bwd_microstep: 177.93 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 176.93 | step_microstep: 1.69
[2025-11-06 19:00:43,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.51 | bwd: 178.99 | bwd_inner: 1.89 | bwd_allreduce: 176.98 | step: 1.76
 87%|████████▋ | 3065/3507 [1:15:57<16:51,  2.29s/it]                                                     {'loss': 0.3529, 'learning_rate': 8.21976896670409e-07, 'epoch': 0.87}
 87%|████████▋ | 3065/3507 [1:15:57<16:51,  2.29s/it]tensor([[-1.6953,  0.6250,  1.4766, -0.1816, -1.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -3.3438, -0.5273,  2.3594, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:44,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.39 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.4375, -4.6562,  0.1118,  1.1875, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7188, -3.4375, -1.2891,  2.9688, -0.2617]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -4.5312, -1.3984,  2.8906, -1.4766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1562, -3.2812,  1.6094,  2.4531, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.6562, -4.5000,  1.5234,  0.2910, -5.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -1.2734,  3.2188, -1.2109, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:00:44,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:00:44,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.24 | bwd_microstep: 579.99 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 578.81 | step_microstep: 1.49
[2025-11-06 19:00:44,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.65 | bwd: 580.81 | bwd_inner: 1.83 | bwd_allreduce: 578.85 | step: 1.56
 87%|████████▋ | 3066/3507 [1:15:58<13:44,  1.87s/it]                                                     {'loss': 0.4447, 'learning_rate': 8.183132719639908e-07, 'epoch': 0.87}
 87%|████████▋ | 3066/3507 [1:15:58<13:44,  1.87s/it]tensor([[-2.1562,  0.6562,  1.9141, -0.4629, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.3438, -4.7500, -0.6641,  2.6875, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -2.2969,  1.5078,  0.7812, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:45,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[2.2656, 3.9062, 5.6562, 5.5312, 2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094,  0.7305,  3.9219, -1.0469, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2812, -4.0000,  1.6328,  1.9219, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9375, -4.9375,  1.1562,  2.0938, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -2.7812,  1.1875,  2.0938, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:45,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 19:00:45,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.50 | bwd_microstep: 173.00 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 172.03 | step_microstep: 2.17
[2025-11-06 19:00:45,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.08 | bwd: 173.66 | bwd_inner: 1.41 | bwd_allreduce: 172.08 | step: 2.25
 87%|████████▋ | 3067/3507 [1:15:59<10:52,  1.48s/it]                                                     {'loss': 0.5645, 'learning_rate': 8.146574817149411e-07, 'epoch': 0.87}
 87%|████████▋ | 3067/3507 [1:15:59<10:52,  1.48s/it]tensor([[-6.4688, -3.7344,  1.2891,  0.7344, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -3.2188,  1.2266,  2.1406, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2344, -2.2969,  0.5312,  2.0312, -1.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -5.1250, -0.4375,  1.4219, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:45,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.55 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.6562, -1.2188,  2.8906, -0.1680, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7344, -3.3125, -1.9219,  1.8281, -0.4180]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -4.2812, -0.6602,  3.7031, -1.3516]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3438,  1.6875,  2.6562, -2.7812, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:00:47,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:00:47,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.15 | bwd_microstep: 2149.93 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2148.85 | step_microstep: 2.24
[2025-11-06 19:00:47,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.72 | bwd: 2150.76 | bwd_inner: 1.67 | bwd_allreduce: 2148.90 | step: 2.34
 87%|████████▋ | 3068/3507 [1:16:01<13:11,  1.80s/it]                                                     {'loss': 0.4013, 'learning_rate': 8.110095290426334e-07, 'epoch': 0.87}
 87%|████████▋ | 3068/3507 [1:16:01<13:11,  1.80s/it]tensor([[-6.1250, -2.7031,  2.8594,  0.5234, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:48,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.40 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.7734,  1.7188,  2.3594, -2.0625, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -5.4062, -1.0547,  1.0234, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -3.7188,  0.6758,  0.8086, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2812, -3.7812,  2.3906,  0.2061, -5.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.3984,  1.4141,  3.6719,  1.7891, -1.3359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -3.5312,  2.3594,  0.6250, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -5.0000, -2.2500,  2.3281, -1.6016]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:00:48,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:00:48,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.69 | bwd_microstep: 162.16 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 161.05 | step_microstep: 2.42
[2025-11-06 19:00:48,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 265.10 | bwd: 163.02 | bwd_inner: 1.78 | bwd_allreduce: 161.10 | step: 2.49
 88%|████████▊ | 3069/3507 [1:16:02<10:13,  1.40s/it]                                                     {'loss': 0.248, 'learning_rate': 8.073694170597579e-07, 'epoch': 0.88}
 88%|████████▊ | 3069/3507 [1:16:02<10:13,  1.40s/it]tensor([[-5.7500, -1.3594,  2.7344, -2.2656, -5.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7656, -0.8359,  3.3594,  3.6719, -1.5703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688, -1.1172,  2.5469,  1.6719, -2.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:48,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.33 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.08
tensor([[-5.6562, -3.8594,  0.4941,  1.2969, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.4062, -7.3750, -3.0938,  1.3906, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.2812, -5.1562, -0.3789,  0.0422, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -4.0938, -0.9531,  2.7656, -1.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -5.4375, -0.2754,  2.9688, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:00:50,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 19:00:50,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.51 | bwd_microstep: 1723.23 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1722.26 | step_microstep: 2.22
[2025-11-06 19:00:50,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.86 | bwd: 1724.08 | bwd_inner: 1.64 | bwd_allreduce: 1722.32 | step: 2.30
 88%|████████▊ | 3070/3507 [1:16:04<11:36,  1.59s/it]                                                     {'loss': 0.3818, 'learning_rate': 8.037371488723078e-07, 'epoch': 0.88}
 88%|████████▊ | 3070/3507 [1:16:04<11:36,  1.59s/it]tensor([[-5.1250, -1.8438,  1.6328, -0.9727, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000,  0.5977,  4.1875, -1.0625, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:50,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.77 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.9062,  1.8984,  3.6094, -3.1250, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -1.8984,  2.5312,  0.5547, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -4.1562, -1.3594,  2.5938, -1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-2.6875,  0.9922,  2.6875, -1.8984, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2500, -4.8750,  1.0234,  1.1562, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1719,  0.3750,  2.4688, -1.0156, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:51,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.23 | optimizer_step: 0.31
[2025-11-06 19:00:51,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.02 | bwd_microstep: 512.99 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 511.98 | step_microstep: 2.47
[2025-11-06 19:00:51,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.81 | bwd: 514.03 | bwd_inner: 1.80 | bwd_allreduce: 512.05 | step: 2.57
 88%|████████▊ | 3071/3507 [1:16:05<10:08,  1.40s/it]                                                     {'loss': 0.685, 'learning_rate': 8.001127275795928e-07, 'epoch': 0.88}
 88%|████████▊ | 3071/3507 [1:16:05<10:08,  1.40s/it]tensor([[-1.7109, -2.4062, -1.5156,  1.9688,  0.3008]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.1562, -3.9688,  1.9531,  0.5039, -5.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0938, -3.7344,  2.2188,  0.2832, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -3.9219,  0.2773,  2.5000, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:51,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.20 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.3906, -0.5078,  1.7734, -0.3691, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -2.5469,  1.7500,  1.2422, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -3.6719,  0.9102,  2.5469, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0938, -5.1562, -1.9766,  1.9219, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:00:51,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:00:51,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.25 | bwd_microstep: 167.65 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 166.64 | step_microstep: 1.48
[2025-11-06 19:00:51,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.46 | bwd: 168.64 | bwd_inner: 1.80 | bwd_allreduce: 166.69 | step: 1.59
 88%|████████▊ | 3072/3507 [1:16:05<08:14,  1.14s/it]                                                     {'loss': 0.4431, 'learning_rate': 7.964961562742212e-07, 'epoch': 0.88}
 88%|████████▊ | 3072/3507 [1:16:05<08:14,  1.14s/it]tensor([[-4.6250, -2.5625,  1.8281,  2.3281, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:52,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.69 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.2500, -3.3125,  1.3750,  2.0781, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -0.3262,  3.3125, -2.0469, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1250, -3.0469, -0.6289,  2.4844, -0.9883]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-5.2188, -3.5938,  0.4453,  1.3359, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9219, -4.5625, -2.4531,  1.6641, -1.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -0.3398,  4.1562,  0.8242, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0000, -1.3516,  3.2500,  0.0771, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:00:53,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 19:00:53,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.62 | bwd_microstep: 1335.51 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 1334.57 | step_microstep: 1.94
[2025-11-06 19:00:53,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.34 | bwd: 1336.35 | bwd_inner: 1.60 | bwd_allreduce: 1334.61 | step: 2.03
 88%|████████▊ | 3073/3507 [1:16:07<09:26,  1.31s/it]                                                     {'loss': 1.1852, 'learning_rate': 7.928874380421059e-07, 'epoch': 0.88}
 88%|████████▊ | 3073/3507 [1:16:07<09:26,  1.31s/it]tensor([[-5.6875, -3.7031, -0.3984, -0.4746, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:53,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.05 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.9688, -4.6250, -0.6055,  0.9648, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8438, -1.5547,  2.2188, -0.4746, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -4.9375, -0.2148,  1.8984, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.8125, -6.2812, -1.6641,  1.7031, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -5.2188, -0.9180,  3.1719, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -3.5469,  0.5742, -0.2061, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -0.9297,  2.3438, -0.3906, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:00:55,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.14
[2025-11-06 19:00:55,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.13 | bwd_microstep: 1282.98 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 1281.66 | step_microstep: 1.53
[2025-11-06 19:00:55,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.22 | bwd: 1283.98 | bwd_inner: 2.11 | bwd_allreduce: 1281.71 | step: 1.62
 88%|████████▊ | 3074/3507 [1:16:09<10:10,  1.41s/it]                                                     {'loss': 0.206, 'learning_rate': 7.892865759624569e-07, 'epoch': 0.88}
 88%|████████▊ | 3074/3507 [1:16:09<10:10,  1.41s/it]tensor([[-4.1562, -4.3438, -0.7656,  3.5156, -1.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:55,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.89 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9062, -4.6250, -0.2734,  1.7109, -3.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -2.7188,  1.1797,  3.8281, -1.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2812, -4.2500,  1.6094,  2.8125, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -1.3594,  2.0938, -1.1406, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1094,  0.4746,  2.3594, -1.0938, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.2812, -5.8125, -0.3301,  1.7266, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -5.8125,  0.2988,  2.6875, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:00:55,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 19:00:55,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.35 | bwd_microstep: 236.62 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 235.64 | step_microstep: 1.67
[2025-11-06 19:00:55,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 289.26 | bwd: 237.55 | bwd_inner: 1.74 | bwd_allreduce: 235.68 | step: 1.75
 88%|████████▊ | 3075/3507 [1:16:09<08:18,  1.15s/it]                                                     {'loss': 0.1192, 'learning_rate': 7.856935731077808e-07, 'epoch': 0.88}
 88%|████████▊ | 3075/3507 [1:16:09<08:18,  1.15s/it]tensor([[-5.0938, -3.2969,  1.1953,  2.0156, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.0938,  0.7188,  0.6758, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:56,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.45 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.5312, -2.8281,  3.0469,  0.0530, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8438, -1.4609,  3.3281, -1.4062, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9141,  2.5000,  2.5625, -1.9062, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7969,  1.3828,  3.8750, -0.8242, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -0.2422,  3.5000, -1.8750, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -3.3281,  0.5859,  1.2188, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:00:59,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.34 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:00:59,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.56 | bwd_microstep: 2847.40 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2846.36 | step_microstep: 3.83
[2025-11-06 19:00:59,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.03 | bwd: 2848.28 | bwd_inner: 1.75 | bwd_allreduce: 2846.40 | step: 3.92
 88%|████████▊ | 3076/3507 [1:16:12<12:51,  1.79s/it]                                                     {'loss': 0.4057, 'learning_rate': 7.821084325438788e-07, 'epoch': 0.88}
 88%|████████▊ | 3076/3507 [1:16:12<12:51,  1.79s/it]tensor([[-1.8984,  2.4531,  3.6406, -2.5000, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.8125, -7.3750, -3.2969,  0.9180, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3125, -7.2500, -3.5469, -1.7656, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:00:59,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.36 | bwd_microstep: 1.15 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.1875, -1.6016,  2.2188,  1.0000, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.3438,  3.5938,  3.3750, -2.2969, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.5156, -0.3359,  1.2500, -1.8750, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6406, -2.4531, -1.5781,  2.2812,  0.4746]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6250, -5.4375, -0.9492,  1.3984, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:00:59,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 19:00:59,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.84 | bwd_microstep: 41.62 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 40.30 | step_microstep: 2.17
[2025-11-06 19:00:59,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.23 | bwd: 42.76 | bwd_inner: 2.29 | bwd_allreduce: 40.34 | step: 2.26
 88%|████████▊ | 3077/3507 [1:16:13<10:05,  1.41s/it]                                                     {'loss': 0.2351, 'learning_rate': 7.785311573298459e-07, 'epoch': 0.88}
 88%|████████▊ | 3077/3507 [1:16:13<10:05,  1.41s/it]tensor([[-3.6406,  0.6133,  3.4531, -1.3984, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0938,  0.3027,  4.2500, -1.0859, -4.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.4219,  0.6914,  3.6094,  3.2812, -0.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:00:59,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.47 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[1.2656, 2.0938, 5.0000, 7.0000, 2.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9375, -4.4688,  1.3281,  1.1250, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -3.8281,  0.3438,  2.3906, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -2.1094,  2.1094,  0.4688, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -1.3828,  2.8438,  0.5898, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:01:00,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.26 | optimizer_step: 0.33
[2025-11-06 19:01:00,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.99 | bwd_microstep: 816.11 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 814.83 | step_microstep: 5.87
[2025-11-06 19:01:00,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.51 | bwd: 817.00 | bwd_inner: 1.93 | bwd_allreduce: 814.89 | step: 5.96
 88%|████████▊ | 3078/3507 [1:16:14<09:45,  1.36s/it]                                                     {'loss': 0.2316, 'learning_rate': 7.749617505180596e-07, 'epoch': 0.88}
 88%|████████▊ | 3078/3507 [1:16:14<09:45,  1.36s/it]tensor([[-3.9531, -4.4688, -1.6484,  2.9219, -1.0703]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4844, -4.5625, -2.0469,  3.3750, -0.4609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:01,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.80 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4844, -0.8242,  2.5469,  1.1484, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -3.5469,  0.0923,  3.0781, -1.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -3.3906,  0.8672,  1.8281, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -3.5156,  1.1406,  2.3125, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -4.5625, -1.9766,  3.1719, -0.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -2.2969,  2.6250, -0.3887, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:01,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:01:01,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.72 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.71 | step_microstep: 2.08
[2025-11-06 19:01:01,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 456.54 | bwd: 2.71 | bwd_inner: 1.80 | bwd_allreduce: 0.75 | step: 2.17
 88%|████████▊ | 3079/3507 [1:16:15<07:53,  1.11s/it]                                                     {'loss': 0.2792, 'learning_rate': 7.714002151541911e-07, 'epoch': 0.88}
 88%|████████▊ | 3079/3507 [1:16:15<07:53,  1.11s/it]tensor([[-4.0938,  0.3398,  2.6094, -2.8281, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0625, -5.2500,  0.4609,  2.1094, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9062, -3.5312, -0.7422,  1.9766, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -3.7344,  1.7734,  0.1260, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:01,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.04 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6875,  0.3711,  4.3438, -2.1406, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3438, -5.0625, -2.2969,  2.5781, -1.2578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5000, -6.4375, -2.0156,  0.4277, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5625, -0.3418,  1.7734, -0.9922, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 19:01:03,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 19:01:03,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.23 | bwd_microstep: 1475.25 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 1474.22 | step_microstep: 2.57
[2025-11-06 19:01:03,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.30 | bwd: 1476.10 | bwd_inner: 1.71 | bwd_allreduce: 1474.26 | step: 2.66
 88%|████████▊ | 3080/3507 [1:16:17<09:28,  1.33s/it]                                                     {'loss': 0.5803, 'learning_rate': 7.678465542771929e-07, 'epoch': 0.88}
 88%|████████▊ | 3080/3507 [1:16:17<09:28,  1.33s/it]tensor([[-5.1875, -4.8438, -1.4219,  1.9609, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -3.9375,  0.8008,  2.0312, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:03,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.97 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.1719,  1.4375,  2.8281, -1.1328, -2.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5312, -5.5312, -0.5664,  2.3438, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -4.0625, -0.6094,  3.0000, -1.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -2.6094,  1.8125, -0.9805, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -3.7656,  0.4062,  2.2188, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0000, -0.5234,  3.3125,  0.5156, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:01:03,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:01:03,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.05 | bwd_microstep: 67.31 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 66.46 | step_microstep: 2.59
[2025-11-06 19:01:03,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.05 | bwd: 68.22 | bwd_inner: 1.57 | bwd_allreduce: 66.51 | step: 2.67
 88%|████████▊ | 3081/3507 [1:16:17<07:34,  1.07s/it]                                                     {'loss': 0.4693, 'learning_rate': 7.643007709192918e-07, 'epoch': 0.88}
 88%|████████▊ | 3081/3507 [1:16:17<07:34,  1.07s/it]tensor([[-5.1875, -3.0469,  1.4922,  1.5781, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -4.8750, -2.7500,  1.1953, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.3125, -6.5000, -1.2734,  0.2539, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -5.1250, -1.0547,  3.0938, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -4.6875,  1.3203,  2.6719, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -4.4688, -0.9492,  1.4609, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:04,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.14 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.8125, -5.0000, -0.4980, -1.5312, -6.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -4.3750, -0.2578,  3.4219, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:01:06,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.27 | optimizer_step: 0.24
[2025-11-06 19:01:06,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.44 | bwd_microstep: 1006.77 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1005.78 | step_microstep: 3.10
[2025-11-06 19:01:06,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.61 | bwd: 1007.75 | bwd_inner: 1.74 | bwd_allreduce: 1005.84 | step: 3.21
 88%|████████▊ | 3082/3507 [1:16:19<10:12,  1.44s/it]                                                     {'loss': 0.2162, 'learning_rate': 7.607628681059998e-07, 'epoch': 0.88}
 88%|████████▊ | 3082/3507 [1:16:19<10:12,  1.44s/it]tensor([[-5.0000, -3.0469,  0.7461,  0.7266, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531,  0.8477,  1.6484, -2.6094, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.0312, -4.4688,  1.0000,  2.6719, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:06,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.30 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1250, -4.5625, -1.7891,  2.4844, -1.3359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -2.9688,  0.0400,  2.3750, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2188,  1.7969,  3.2031, -1.6172, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -0.8906,  3.3750, -1.0391, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5000, -3.4219,  1.3281,  1.5547, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:01:06,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.30 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:01:06,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.28 | bwd_microstep: 83.34 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 82.44 | step_microstep: 12.15
[2025-11-06 19:01:06,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 541.60 | bwd: 84.17 | bwd_inner: 1.55 | bwd_allreduce: 82.47 | step: 12.23
 88%|████████▊ | 3083/3507 [1:16:20<08:35,  1.22s/it]                                                     {'loss': 0.6214, 'learning_rate': 7.572328488561064e-07, 'epoch': 0.88}
 88%|████████▊ | 3083/3507 [1:16:20<08:35,  1.22s/it]tensor([[-4.5312, -3.4531, -0.0830,  1.5859, -2.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -5.2188, -1.4766, -0.1123, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-9.0625, -6.1875,  0.2480, -0.1611, -6.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.2812, -5.4062,  0.3281,  1.6406, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:07,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.29 | bwd_microstep: 3.79 | bwd_inner_microstep: 3.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-8.3125, -5.2500,  0.7109, -0.6250, -6.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9219, -3.2969, -1.1562,  2.6406, -0.5508]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1094, -1.5469,  1.7266,  2.2969, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -1.1094,  2.3281,  0.1084, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:01:08,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 19:01:08,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.10 | bwd_microstep: 1057.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1057.01 | step_microstep: 2.70
[2025-11-06 19:01:08,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.41 | bwd: 1061.70 | bwd_inner: 4.47 | bwd_allreduce: 1057.08 | step: 2.80
 88%|████████▊ | 3084/3507 [1:16:22<10:37,  1.51s/it]                                                     {'loss': 0.4785, 'learning_rate': 7.537107161816681e-07, 'epoch': 0.88}
 88%|████████▊ | 3084/3507 [1:16:22<10:37,  1.51s/it]tensor([[-3.4844, -4.4062, -2.5469,  2.0625, -0.7266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4375, -6.2500, -1.0234,  1.6797, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -4.9375, -2.0938,  1.2969, -2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:09,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.42 | bwd_microstep: 1.37 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.8633, -1.4531, -1.6641,  0.8711,  0.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.8750, -2.6406,  0.8281,  0.2441, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.6250, -3.2188,  1.9297,  1.9141, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -5.1562, -0.9141,  2.5000, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -4.4062,  1.3047,  1.9766, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:11,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 19:01:11,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.50 | bwd_microstep: 2074.09 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2072.79 | step_microstep: 2.93
[2025-11-06 19:01:11,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.98 | bwd: 2075.43 | bwd_inner: 2.36 | bwd_allreduce: 2072.85 | step: 3.02
 88%|████████▊ | 3085/3507 [1:16:25<12:32,  1.78s/it]                                                     {'loss': 0.3521, 'learning_rate': 7.501964730880151e-07, 'epoch': 0.88}
 88%|████████▊ | 3085/3507 [1:16:25<12:32,  1.78s/it]tensor([[-2.6250, -3.3125, -2.3125,  1.1953, -0.4258]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2656, -3.0625, -1.9297,  1.9297,  0.0051]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5078,  0.7891,  2.8750,  1.2891, -1.2891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3906,  0.1250,  2.4062, -1.5078, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[ 0.0967,  3.2344,  4.7812,  1.1797, -0.7773]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -0.7266,  2.7656, -1.4297, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:01:12,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.60 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.5000, -3.4531,  2.5625,  1.1250, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2500, -2.5938,  1.2812,  4.2500, -1.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:13,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:01:13,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.49 | bwd_microstep: 798.38 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 797.13 | step_microstep: 1.99
[2025-11-06 19:01:13,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 482.09 | bwd: 799.37 | bwd_inner: 2.03 | bwd_allreduce: 797.18 | step: 2.08
 88%|████████▊ | 3086/3507 [1:16:27<13:00,  1.85s/it]                                                     {'loss': 0.7873, 'learning_rate': 7.466901225737455e-07, 'epoch': 0.88}
 88%|████████▊ | 3086/3507 [1:16:27<13:00,  1.85s/it]tensor([[-6.0312, -4.5625,  0.1553,  1.7109, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')tensor([[-4.7812, -4.4688, -0.4883,  3.0156, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)
 tensor([3], device='cuda:0')
[2025-11-06 19:01:13,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.09 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0625, -4.4375,  0.2852,  1.3984, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -0.3105,  3.5625, -1.2656, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9375, -4.6562,  0.9023,  1.1406, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0938, -3.5000,  1.1719,  0.4414, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0938, -3.6875, -1.0078,  3.5938, -0.4238]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0625, -2.2812,  1.8125,  0.0986, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:13,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:01:13,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.20 | bwd_microstep: 45.40 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 44.46 | step_microstep: 1.50
[2025-11-06 19:01:13,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 346.31 | bwd: 46.40 | bwd_inner: 1.78 | bwd_allreduce: 44.50 | step: 1.59
 88%|████████▊ | 3087/3507 [1:16:27<09:58,  1.43s/it]                                                     {'loss': 0.6029, 'learning_rate': 7.431916676307238e-07, 'epoch': 0.88}
 88%|████████▊ | 3087/3507 [1:16:27<09:58,  1.43s/it]tensor([[ 0.0693,  4.3750,  5.8125,  0.0698, -1.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3203,  1.6953,  2.2656, -1.2188, -1.9453]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5938, -3.2969,  1.2344, -1.1562, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -3.6875,  0.6797,  0.0898, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:14,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.95 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9062,  2.4844,  4.1250, -1.8750, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -1.6953,  3.4219,  0.4277, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.7500,  1.8203,  4.0000,  2.3594, -0.7305]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -4.4375, -0.4043,  2.6875, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:15,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.26 | optimizer_step: 0.22
[2025-11-06 19:01:15,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.48 | bwd_microstep: 1458.35 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1457.24 | step_microstep: 2.39
[2025-11-06 19:01:15,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.45 | bwd: 1459.36 | bwd_inner: 1.91 | bwd_allreduce: 1457.28 | step: 2.48
 88%|████████▊ | 3088/3507 [1:16:29<11:22,  1.63s/it]                                                     {'loss': 0.3034, 'learning_rate': 7.397011112440744e-07, 'epoch': 0.88}
 88%|████████▊ | 3088/3507 [1:16:29<11:22,  1.63s/it]tensor([[-5.5938e+00, -4.6875e+00,  3.9368e-03,  2.8125e+00, -2.9375e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -3.0156,  0.0635,  2.2500, -1.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -4.3125, -0.0064,  1.0312, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -4.5938,  0.8203,  2.7188, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:16,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.74 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.5625e+00, -3.3281e+00,  2.1719e+00,  5.4321e-03, -5.3750e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -3.6094,  0.7773,  1.5703, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8594, -0.2988,  2.6094, -0.7461, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.3750, -6.6562, -1.9688, -0.8203, -5.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:17,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:01:17,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.25 | bwd_microstep: 1209.24 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 1207.93 | step_microstep: 2.09
[2025-11-06 19:01:17,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.02 | bwd: 1210.28 | bwd_inner: 2.15 | bwd_allreduce: 1207.98 | step: 2.18
 88%|████████▊ | 3089/3507 [1:16:31<11:22,  1.63s/it]                                                     {'loss': 0.5407, 'learning_rate': 7.36218456392187e-07, 'epoch': 0.88}
 88%|████████▊ | 3089/3507 [1:16:31<11:22,  1.63s/it]tensor([[-2.9688,  1.5625,  3.2031, -2.6562, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.7188, -5.9375, -1.5469,  1.4141, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:17,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.90 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.9062, -4.2188,  1.7969,  1.5312, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -2.1562,  1.1406,  0.8594, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.1406,  0.7500,  1.6641, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -5.0312, -0.3945,  2.5781, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5000, -6.7500, -2.2969,  0.4941, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.0625, -6.1875,  0.6211,  0.5039, -6.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:01:19,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:01:19,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.70 | bwd_microstep: 1723.47 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1722.58 | step_microstep: 2.61
[2025-11-06 19:01:19,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.63 | bwd: 1724.35 | bwd_inner: 1.58 | bwd_allreduce: 1722.63 | step: 2.69
 88%|████████▊ | 3090/3507 [1:16:33<12:27,  1.79s/it]                                                     {'loss': 0.633, 'learning_rate': 7.327437060467047e-07, 'epoch': 0.88}
 88%|████████▊ | 3090/3507 [1:16:33<12:27,  1.79s/it]tensor([[-3.4531, -3.9844, -1.6719,  2.4844, -0.8242]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1562, -2.7188,  1.5000,  0.8633, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:19,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.83 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2812, -1.5469,  2.1562,  0.6562, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -3.3438,  0.4023,  0.3672, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -4.7188,  0.5273,  2.2188, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7812, -3.1875,  0.2852,  3.0469, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.6250, -6.5312, -1.0625,  1.8594, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -0.9805,  3.3125, -0.9219, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:01:20,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 19:01:20,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.15 | bwd_microstep: 137.61 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 136.45 | step_microstep: 1.51
[2025-11-06 19:01:20,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 297.01 | bwd: 138.65 | bwd_inner: 2.02 | bwd_allreduce: 136.48 | step: 1.60
 88%|████████▊ | 3091/3507 [1:16:33<09:40,  1.39s/it]                                                     {'loss': 0.2982, 'learning_rate': 7.292768631725266e-07, 'epoch': 0.88}
 88%|████████▊ | 3091/3507 [1:16:33<09:40,  1.39s/it]tensor([[-5.5000, -4.7500, -0.3008,  2.7344, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5938, -6.0312, -1.5234,  1.6562, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8438, -3.7031,  2.0000,  0.3613, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -1.6328,  2.7188,  0.1230, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8438, -3.3281,  1.4375, -1.0781, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:20,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.37 | bwd_microstep: 1.17 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.6250, -6.0938, -2.1406,  3.0781, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -1.6094,  2.3438,  1.0859, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-9.5000, -8.1250, -2.3125,  0.4492, -6.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:23,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.21 | optimizer_step: 0.27
[2025-11-06 19:01:23,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.03 | bwd_microstep: 2451.08 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 2449.96 | step_microstep: 2.43
[2025-11-06 19:01:23,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 449.42 | bwd: 2452.25 | bwd_inner: 2.09 | bwd_allreduce: 2450.03 | step: 2.54
 88%|████████▊ | 3092/3507 [1:16:36<12:51,  1.86s/it]                                                     {'loss': 0.2559, 'learning_rate': 7.258179307278068e-07, 'epoch': 0.88}
 88%|████████▊ | 3092/3507 [1:16:36<12:51,  1.86s/it]tensor([[-5.5312, -5.3750, -1.6406,  2.1719, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6719, -3.4844, -1.0234,  1.9922, -1.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:23,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.72 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.7500, -5.2812,  0.9609,  1.4219, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406,  0.9688,  2.3281, -1.7031, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -5.0000,  0.7930,  2.6719, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -4.4688,  0.2031,  1.3594, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.2969,  1.2422,  2.0469, -0.1924, -1.4766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -5.2500, -0.2363,  2.9844, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:23,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:01:23,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.29 | bwd_microstep: 110.58 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 109.75 | step_microstep: 1.89
[2025-11-06 19:01:23,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.04 | bwd: 111.34 | bwd_inner: 1.41 | bwd_allreduce: 109.79 | step: 1.96
 88%|████████▊ | 3093/3507 [1:16:37<10:08,  1.47s/it]                                                     {'loss': 0.3704, 'learning_rate': 7.223669116639487e-07, 'epoch': 0.88}
 88%|████████▊ | 3093/3507 [1:16:37<10:08,  1.47s/it]tensor([[-3.9531, -2.6562,  0.3438,  1.6484, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:23,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.45 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9844, -0.4668,  3.0000,  2.0625, -2.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -4.0312,  0.5195,  1.2188, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062,  2.0000,  3.0625, -2.6875, -3.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -4.5312, -1.3203,  1.8047, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -3.8125,  0.9414,  1.3672, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -3.9531,  0.0864,  0.8633, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.2812, -3.9219,  2.0156,  0.1260, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:01:24,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 19:01:24,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 233.27 | bwd_microstep: 38.49 | bwd_inner_microstep: 5.77 | bwd_allreduce_microstep: 32.62 | step_microstep: 2.05
[2025-11-06 19:01:24,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.75 | bwd: 39.18 | bwd_inner: 6.37 | bwd_allreduce: 32.66 | step: 2.13
 88%|████████▊ | 3094/3507 [1:16:37<08:04,  1.17s/it]                                                     {'loss': 0.7993, 'learning_rate': 7.189238089256034e-07, 'epoch': 0.88}
 88%|████████▊ | 3094/3507 [1:16:37<08:04,  1.17s/it]tensor([[-6.3438, -3.6406,  1.5391,  0.7188, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188, -0.6836,  2.6094,  0.0728, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4375, -1.6641,  2.0938,  0.7695, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0625,  0.6328,  3.9375, -1.9141, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:24,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.60 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.5312, -4.5938, -0.4473,  2.0312, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -5.4688, -1.3750,  2.2344, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -2.8594,  1.0859,  2.6562, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -1.1484,  3.0156,  0.1245, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:01:26,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:01:26,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.00 | bwd_microstep: 1511.62 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1510.46 | step_microstep: 2.18
[2025-11-06 19:01:26,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 507.63 | bwd: 1512.50 | bwd_inner: 1.84 | bwd_allreduce: 1510.50 | step: 2.27
 88%|████████▊ | 3095/3507 [1:16:40<09:54,  1.44s/it]                                                     {'loss': 0.3126, 'learning_rate': 7.154886254506632e-07, 'epoch': 0.88}
 88%|████████▊ | 3095/3507 [1:16:40<09:54,  1.44s/it]tensor([[-3.1250,  0.8203,  3.2656, -1.1641, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -2.7188,  1.4844, -0.2695, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:26,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.16 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5625, -3.9062,  0.4980,  1.3203, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -3.6875,  1.0312,  2.7969, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875e+00, -7.8906e-01,  2.7500e+00,  4.1199e-03, -3.7812e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4844,  1.6797,  2.0469, -3.7031, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -3.0938,  1.6094, -0.4570, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4062, -3.7344,  0.0137,  2.7188, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:27,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 19:01:27,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.10 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.91
[2025-11-06 19:01:27,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.30 | bwd: 3.08 | bwd_inner: 2.07 | bwd_allreduce: 0.88 | step: 2.98
 88%|████████▊ | 3096/3507 [1:16:40<08:46,  1.28s/it]                                                     {'loss': 0.4239, 'learning_rate': 7.120613641702723e-07, 'epoch': 0.88}
 88%|████████▊ | 3096/3507 [1:16:40<08:46,  1.28s/it]tensor([[-3.1094,  0.2910,  2.5625, -0.7773, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5312, -3.3906,  1.1484,  1.2109, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -3.6250, -0.0505,  1.7891, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:27,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.5625, -5.6250, -0.5156,  0.3828, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5938,  0.6133,  3.0938, -1.8516, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.0078,  3.0312,  3.2188, -2.4219, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.3438, -1.8672,  3.5156, -1.1641, -6.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -3.1406,  0.4863,  1.4609, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:01:29,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 19:01:29,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.10 | bwd_microstep: 2354.82 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 2353.48 | step_microstep: 5.72
[2025-11-06 19:01:29,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.76 | bwd: 2355.74 | bwd_inner: 2.06 | bwd_allreduce: 2353.54 | step: 5.80
 88%|████████▊ | 3097/3507 [1:16:43<11:46,  1.72s/it]                                                     {'loss': 0.3201, 'learning_rate': 7.086420280088091e-07, 'epoch': 0.88}
 88%|████████▊ | 3097/3507 [1:16:43<11:46,  1.72s/it]tensor([[-2.1719,  2.3906,  3.4531, -2.6094, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:30,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 101.76 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.2812, -4.7188,  1.0469,  1.1328, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2344, -4.1250, -2.4688,  2.4219, -0.4160]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0938, -5.8750, -0.4824,  2.2656, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -5.1562,  0.2451,  2.6250, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -3.1562, -0.6406, -4.0625, -6.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.5625,  0.6289,  2.1250, -3.1875, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7500, -0.0776,  3.9375,  0.6523, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:31,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.16 | optimizer_step: 0.22
[2025-11-06 19:01:31,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.54 | bwd_microstep: 3.30 | bwd_inner_microstep: 2.18 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.52
[2025-11-06 19:01:31,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.31 | bwd: 4.08 | bwd_inner: 2.86 | bwd_allreduce: 1.07 | step: 7.60
 88%|████████▊ | 3098/3507 [1:16:44<10:37,  1.56s/it]                                                     {'loss': 0.5193, 'learning_rate': 7.052306198838854e-07, 'epoch': 0.88}
 88%|████████▊ | 3098/3507 [1:16:44<10:37,  1.56s/it]tensor([[-3.8750, -2.6562,  0.2412,  1.2422, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -2.9844,  1.2500,  1.3203, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -5.7812, -2.2500,  1.1641, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -2.8594,  2.4062,  1.5312, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:31,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.04 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3125, -4.3438, -1.0547,  2.8125, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -2.1406,  1.3047,  0.8555, -3.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438, -3.2188,  0.5234,  4.3438, -0.8477]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9375, -4.0000, -1.1328,  2.5156, -1.4297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:31,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:01:31,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.38 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.96
[2025-11-06 19:01:31,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.43 | bwd: 2.74 | bwd_inner: 1.71 | bwd_allreduce: 0.87 | step: 2.04
 88%|████████▊ | 3099/3507 [1:16:45<08:18,  1.22s/it]                                                     {'loss': 0.4905, 'learning_rate': 7.018271427063583e-07, 'epoch': 0.88}
 88%|████████▊ | 3099/3507 [1:16:45<08:18,  1.22s/it]tensor([[-4.0312, -2.6094,  0.6797,  1.6797, -2.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3750,  0.0669,  2.2031, -1.2188, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.3438, -4.0312, -1.5703,  3.0938, -0.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -4.2188,  1.7109,  1.5547, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -0.7773,  2.0156, -0.1758, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -2.0625,  1.8906,  0.0254, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -5.1875, -0.0476,  2.5156, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:32,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.56 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7500, -3.7031,  0.8711,  1.1094, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:34,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 19:01:34,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.30 | bwd_microstep: 2.21 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.66
[2025-11-06 19:01:34,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.88 | bwd: 3.11 | bwd_inner: 2.06 | bwd_allreduce: 0.92 | step: 2.74
 88%|████████▊ | 3100/3507 [1:16:48<12:24,  1.83s/it]                                                     {'loss': 1.0417, 'learning_rate': 6.984315993803104e-07, 'epoch': 0.88}
 88%|████████▊ | 3100/3507 [1:16:48<12:24,  1.83s/it]tensor([[-4.6250, -4.9062, -1.7656,  2.5938, -1.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:34,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.75 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5625, -3.5000,  0.0718,  0.0288, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -2.1875,  3.1094,  1.3672, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6562,  1.1562,  1.9219, -2.7812, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0312, -1.9688,  1.6719,  1.8047, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2188, -4.4375, -0.9062, -0.6445, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -4.5000,  0.4062,  3.0938, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0625,  0.0036,  3.5156,  3.5625, -1.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:01:35,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:01:35,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.40 | bwd_microstep: 282.19 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 281.10 | step_microstep: 2.11
[2025-11-06 19:01:35,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.18 | bwd: 282.97 | bwd_inner: 1.72 | bwd_allreduce: 281.13 | step: 2.19
 88%|████████▊ | 3101/3507 [1:16:49<10:02,  1.48s/it]                                                     {'loss': 0.546, 'learning_rate': 6.950439928030583e-07, 'epoch': 0.88}
 88%|████████▊ | 3101/3507 [1:16:49<10:02,  1.48s/it]tensor([[-6.7812, -3.4688,  2.0938,  0.1270, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -1.6250,  2.0938, -0.3457, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -2.8125,  2.3750,  1.2891, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8750, -2.7969,  0.2090,  3.7188, -0.6211]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750,  0.1318,  3.8438, -1.8906, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5938, -4.7188,  1.4922,  0.5586, -5.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -3.4062,  0.2471,  3.3125, -1.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:35,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.66 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.2656, -2.4062,  0.6562,  4.7812,  0.0679]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:37,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 19:01:37,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.99 | bwd_microstep: 1.82 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.81 | step_microstep: 191.04
[2025-11-06 19:01:37,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.65 | bwd: 2.79 | bwd_inner: 1.78 | bwd_allreduce: 0.86 | step: 191.14
 88%|████████▊ | 3102/3507 [1:16:51<12:12,  1.81s/it]                                                     {'loss': 0.1284, 'learning_rate': 6.916643258651434e-07, 'epoch': 0.88}
 88%|████████▊ | 3102/3507 [1:16:51<12:12,  1.81s/it]tensor([[-9.6250, -6.7188, -0.4453, -0.8945, -7.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031,  0.8047,  2.5625, -2.6406, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6875, -6.8125, -2.6094,  2.1406, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -1.3438,  1.8125, -1.8516, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:38,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.81 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.5000, -2.7812,  0.6484,  1.3438, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0625,  0.8281,  2.0625, -0.8750, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.8594, -3.7969, -2.7188,  1.4688, -0.3613]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -4.8438, -1.4062,  2.4062, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:38,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 19:01:38,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.21 | bwd_microstep: 1.69 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 1.99
[2025-11-06 19:01:38,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 394.05 | bwd: 2.61 | bwd_inner: 1.66 | bwd_allreduce: 0.80 | step: 2.09
 88%|████████▊ | 3103/3507 [1:16:52<09:24,  1.40s/it]                                                     {'loss': 0.3676, 'learning_rate': 6.882926014503344e-07, 'epoch': 0.88}
 88%|████████▊ | 3103/3507 [1:16:52<09:24,  1.40s/it]tensor([[-4.3750, -0.2295,  3.6250, -0.9258, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -5.1875, -0.2578,  2.2656, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.1562,  0.3301,  2.3906, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8438, -5.1250, -0.4199,  1.0234, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -1.9375,  1.2891,  0.8242, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9375, -5.3125, -1.1641,  1.8359, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -3.3438,  0.5586,  2.2656, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:38,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.8516,  0.3242,  3.8125,  3.6406, -1.0547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:40,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:01:40,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.88 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.52
[2025-11-06 19:01:40,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.65 | bwd: 2.75 | bwd_inner: 1.72 | bwd_allreduce: 0.90 | step: 2.60
 89%|████████▊ | 3104/3507 [1:16:54<11:25,  1.70s/it]                                                     {'loss': 0.4688, 'learning_rate': 6.849288224356221e-07, 'epoch': 0.89}
 89%|████████▊ | 3104/3507 [1:16:54<11:25,  1.70s/it]tensor([[-4.6250, -1.3281,  1.9219, -0.9492, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -3.3906, -0.1187,  2.2344, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:41,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.11 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0938, -2.8281,  0.8828,  2.2188, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -1.5469,  3.0156, -1.4375, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -4.3750,  0.3301,  2.6094, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.2500, -5.1562,  1.1094,  0.0610, -6.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0938, -4.9688,  1.4219,  2.2344, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4844, -0.8438,  1.1250, -1.0781, -3.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:01:43,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:01:43,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.71 | bwd_microstep: 2576.42 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2575.18 | step_microstep: 1.98
[2025-11-06 19:01:43,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.85 | bwd: 2577.34 | bwd_inner: 1.99 | bwd_allreduce: 2575.22 | step: 2.06
 89%|████████▊ | 3105/3507 [1:16:57<13:57,  2.08s/it]                                                     {'loss': 0.1862, 'learning_rate': 6.815729916912184e-07, 'epoch': 0.89}
 89%|████████▊ | 3105/3507 [1:16:57<13:57,  2.08s/it]tensor([[-3.9375, -3.7188, -0.1748,  3.4375, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -3.4531,  0.6133,  0.9609, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:43,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.90 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.4062, -4.3438, -0.1846,  1.9531, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -1.7188,  2.9844,  0.3242, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -2.4844,  1.6797,  0.6641, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.2656,  0.6602,  2.4062, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.8750, -3.2656,  2.8906,  0.4668, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7969, -3.0312, -0.4668,  1.3750, -1.9453]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:01:44,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:01:44,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.77 | bwd_microstep: 198.39 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 197.50 | step_microstep: 2.09
[2025-11-06 19:01:44,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.70 | bwd: 199.10 | bwd_inner: 1.41 | bwd_allreduce: 197.54 | step: 2.18
 89%|████████▊ | 3106/3507 [1:16:58<10:54,  1.63s/it]                                                     {'loss': 0.8067, 'learning_rate': 6.782251120805528e-07, 'epoch': 0.89}
 89%|████████▊ | 3106/3507 [1:16:58<10:54,  1.63s/it]tensor([[-4.2500, -1.4922,  2.7969,  1.3906, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -0.6094,  3.8594,  0.7227, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -4.6875, -0.5312,  3.2344, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -3.5781,  1.3359,  3.3125, -2.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:44,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.37 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.3750, -4.2812,  0.4062,  2.7344, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -1.7422,  2.1406,  0.2520, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -0.4199,  3.0781, -0.5938, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312,  0.5078,  2.7500, -1.9062, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:01:45,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 19:01:45,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.53 | bwd_microstep: 705.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 705.00 | step_microstep: 5.34
[2025-11-06 19:01:45,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 478.94 | bwd: 706.69 | bwd_inner: 1.49 | bwd_allreduce: 705.05 | step: 5.44
 89%|████████▊ | 3107/3507 [1:16:59<10:04,  1.51s/it]                                                     {'loss': 0.1092, 'learning_rate': 6.748851864602691e-07, 'epoch': 0.89}
 89%|████████▊ | 3107/3507 [1:16:59<10:04,  1.51s/it]tensor([[-4.9375, -4.2812, -0.4102,  2.4688, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[7.1562, 8.0625, 8.1250, 8.3125, 6.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2188, -5.8750, -1.5938,  2.3125, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:01:45,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.42 | bwd_microstep: 7.64 | bwd_inner_microstep: 7.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5938, -3.4062, -0.3965,  2.6406, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -2.5938,  0.7109,  0.3574, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9531,  1.0391,  2.6406, -2.5156, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -3.4062,  1.5156,  0.9336, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062,  0.2754,  3.3438, -1.2891, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:46,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:01:46,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.68 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.77 | step_microstep: 2.54
[2025-11-06 19:01:46,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.13 | bwd: 9.30 | bwd_inner: 8.35 | bwd_allreduce: 0.81 | step: 2.63
 89%|████████▊ | 3108/3507 [1:17:00<08:38,  1.30s/it]                                                     {'loss': 0.4329, 'learning_rate': 6.715532176802298e-07, 'epoch': 0.89}
 89%|████████▊ | 3108/3507 [1:17:00<08:38,  1.30s/it]tensor([[-4.0938, -3.8125, -0.7461,  2.2812, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5156,  1.6250,  2.2656, -1.1406, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:46,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.22 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2969,  0.8789,  2.7344, -2.3750, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.4375, -5.5625,  0.4238,  1.9531, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9062, -5.8750, -2.4062,  1.4609, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -4.9375, -1.1953,  1.6484, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.7734,  1.7891,  2.6719, -1.7891, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[ 0.8008,  4.4062,  3.6875, -1.3203, -0.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 19:01:48,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 19:01:48,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.31 | bwd_microstep: 1510.67 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 1509.87 | step_microstep: 2.11
[2025-11-06 19:01:48,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.56 | bwd: 1511.33 | bwd_inner: 1.25 | bwd_allreduce: 1509.93 | step: 2.19
 89%|████████▊ | 3109/3507 [1:17:02<09:44,  1.47s/it]                                                     {'loss': 0.2216, 'learning_rate': 6.682292085834985e-07, 'epoch': 0.89}
 89%|████████▊ | 3109/3507 [1:17:02<09:44,  1.47s/it]tensor([[-5.6562, -4.6250,  0.0280,  2.4844, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -4.3750, -0.5742,  3.3125, -1.7109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -4.0938,  0.7617,  2.3906, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:48,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.52 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5000, -6.3438, -3.0000,  0.5938, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656, -3.3125, -0.1758,  3.0156, -1.4141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -3.0625,  1.0938,  1.3516, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -4.0000, -0.1572,  2.5312, -2.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8594, -0.0479,  3.0625, -0.9023, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:49,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:01:49,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.90 | bwd_microstep: 1.84 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 1.90
[2025-11-06 19:01:49,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.41 | bwd: 2.74 | bwd_inner: 1.82 | bwd_allreduce: 0.80 | step: 1.98
 89%|████████▊ | 3110/3507 [1:17:03<10:00,  1.51s/it]                                                     {'loss': 0.1725, 'learning_rate': 6.649131620063554e-07, 'epoch': 0.89}
 89%|████████▊ | 3110/3507 [1:17:03<10:00,  1.51s/it]tensor([[ 0.1406, -0.4238,  1.1641,  4.9375,  1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:50,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.50 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7500, -1.8125,  2.0156,  0.6523, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1875,  2.4688,  4.5938, -1.8516, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.2188, 4.1250, 5.5938, 2.5312, 0.3379]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[ 0.2119,  3.0625,  1.7344, -1.8906, -0.8789]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-8.1250, -5.2812,  1.0625,  0.5508, -6.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -3.1719,  0.5352,  0.8906, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5469,  1.3047,  2.9219, -1.5156, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:01:51,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:01:51,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.16 | bwd_microstep: 1319.97 | bwd_inner_microstep: 1.54 | bwd_allreduce_microstep: 1318.35 | step_microstep: 2.18
[2025-11-06 19:01:51,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.69 | bwd: 1320.95 | bwd_inner: 2.43 | bwd_allreduce: 1318.39 | step: 2.26
 89%|████████▊ | 3111/3507 [1:17:05<10:20,  1.57s/it]                                                     {'loss': 0.3414, 'learning_rate': 6.616050807782803e-07, 'epoch': 0.89}
 89%|████████▊ | 3111/3507 [1:17:05<10:20,  1.57s/it]tensor([[-5.0625, -2.9844,  1.2422,  1.5000, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4219, -3.0625, -0.4668,  4.2500,  0.1660]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -3.3906,  0.1069,  2.7188, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -3.2344,  2.6094,  0.8789, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.7812, -6.4688, -0.7109,  1.9922, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:51,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.31 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2500, -4.5938, -0.5234,  0.5195, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -4.1250,  0.8711,  2.5781, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.5469,  0.6953,  3.6094, -1.9453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:01:52,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:01:52,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.41 | bwd_microstep: 1.72 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 1.76
[2025-11-06 19:01:52,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.72 | bwd: 2.53 | bwd_inner: 1.63 | bwd_allreduce: 0.78 | step: 1.85
 89%|████████▊ | 3112/3507 [1:17:06<09:50,  1.49s/it]                                                     {'loss': 0.3964, 'learning_rate': 6.583049677219633e-07, 'epoch': 0.89}
 89%|████████▊ | 3112/3507 [1:17:06<09:50,  1.49s/it]tensor([[-2.2812, -2.8125, -0.6875,  3.2344, -0.0542]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7969, -0.3105,  2.3438, -1.2109, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2188, -1.7656,  3.3281,  0.7812, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:53,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.30 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.1250, -2.6094,  2.7656,  0.4570, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -4.8438, -0.0757,  3.4062, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -0.7383,  3.2031, -2.2031, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -4.0938, -0.4453,  1.4453, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -4.5625, -0.7383,  3.5312, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:01:54,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.20 | optimizer_step: 0.25
[2025-11-06 19:01:54,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.13 | bwd_microstep: 701.08 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 700.25 | step_microstep: 2.18
[2025-11-06 19:01:54,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.45 | bwd: 701.90 | bwd_inner: 1.48 | bwd_allreduce: 700.29 | step: 2.25
 89%|████████▉ | 3113/3507 [1:17:07<09:02,  1.38s/it]                                                     {'loss': 0.0718, 'learning_rate': 6.550128256532906e-07, 'epoch': 0.89}
 89%|████████▉ | 3113/3507 [1:17:07<09:02,  1.38s/it]tensor([[-3.3438, -3.9531, -1.3672,  3.2500, -0.5117]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.6875,  0.5391,  1.9219, -2.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -0.8242,  3.2812, -1.2031, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -3.3750,  1.1484,  0.9648, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:54,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.50 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2188, -4.0625, -2.3594,  2.0469, -0.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -0.4473,  2.9219, -0.5430, -3.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9531, -0.6328,  2.9375,  2.5469, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.6562,  1.5312,  3.2344, -1.9297, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:01:55,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.24 | optimizer_step: 0.24
[2025-11-06 19:01:55,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.60 | bwd_microstep: 439.77 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 438.86 | step_microstep: 2.40
[2025-11-06 19:01:55,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 498.14 | bwd: 440.60 | bwd_inner: 1.51 | bwd_allreduce: 438.92 | step: 2.50
 89%|████████▉ | 3114/3507 [1:17:08<08:15,  1.26s/it]                                                     {'loss': 0.3956, 'learning_rate': 6.517286573813453e-07, 'epoch': 0.89}
 89%|████████▉ | 3114/3507 [1:17:08<08:15,  1.26s/it]tensor([[-4.2500, -2.1406,  1.9375,  2.0156, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8906,  3.3750,  5.0625, -0.7969, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:55,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.92 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.8125, -3.3750,  2.2344,  0.0143, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1562,  1.6953,  2.7656, -2.0625, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-1.9922,  1.8438,  3.0625, -1.7891, -2.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.1875, -5.9688, -0.1543,  2.5156, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -1.7812,  2.1719,  1.7344, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.0625,  0.2617,  1.9297, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:01:58,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.25 | optimizer_step: 0.35
[2025-11-06 19:01:58,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.69 | bwd_microstep: 2971.09 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2969.88 | step_microstep: 2.81
[2025-11-06 19:01:58,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 378.63 | bwd: 2972.03 | bwd_inner: 1.91 | bwd_allreduce: 2969.94 | step: 2.91
 89%|████████▉ | 3115/3507 [1:17:12<13:32,  2.07s/it]                                                     {'loss': 0.905, 'learning_rate': 6.484524657084134e-07, 'epoch': 0.89}
 89%|████████▉ | 3115/3507 [1:17:12<13:32,  2.07s/it]tensor([[-3.1875, -3.1875,  0.0144,  3.8438, -0.6992]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7500, -5.3750, -1.2891,  1.9219, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -2.2812, -0.4180, -2.1562, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.1562,  1.9219,  1.5312, -2.0469, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:01:59,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.69 | bwd_microstep: 15.91 | bwd_inner_microstep: 15.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.3125, -2.6562,  2.1875,  1.4844, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -0.2910,  3.7656, -1.4766, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -1.7031,  1.7344, -0.6836, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0938, -2.3281,  2.3281, -1.0312, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:01:59,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:01:59,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.49 | bwd_microstep: 86.01 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 85.15 | step_microstep: 1.83
[2025-11-06 19:01:59,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.22 | bwd: 101.92 | bwd_inner: 16.54 | bwd_allreduce: 85.21 | step: 1.94
 89%|████████▉ | 3116/3507 [1:17:13<10:30,  1.61s/it]                                                     {'loss': 0.282, 'learning_rate': 6.45184253429969e-07, 'epoch': 0.89}
 89%|████████▉ | 3116/3507 [1:17:13<10:30,  1.61s/it]tensor([[-4.4062, -3.0312, -0.0249,  0.8867, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6406, -0.3281,  2.6562,  1.5859, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:01:59,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.22 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.9375, -4.9688,  0.8516,  1.9141, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906,  0.9961,  3.2188, -0.3945, -2.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -2.9375,  0.7031,  2.3594, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.2812, -3.2344, -2.2500,  2.0312,  0.1816]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.2812, -7.8438, -4.5312,  0.4180, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -1.7578,  1.8438,  1.0469, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:02:03,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:02:03,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.54 | bwd_microstep: 3796.62 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 3795.68 | step_microstep: 1.95
[2025-11-06 19:02:03,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.83 | bwd: 3797.33 | bwd_inner: 1.44 | bwd_allreduce: 3795.73 | step: 2.04
 89%|████████▉ | 3117/3507 [1:17:17<15:32,  2.39s/it]                                                     {'loss': 0.5808, 'learning_rate': 6.419240233346801e-07, 'epoch': 0.89}
 89%|████████▉ | 3117/3507 [1:17:17<15:32,  2.39s/it]tensor([[-4.4688, -0.3320,  3.3438, -1.3906, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -3.0625,  1.2188,  0.1680, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -0.5117,  3.5625, -1.2734, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:03,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.46 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.1875, -3.1406,  1.0703,  1.4531, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -3.2188,  1.8906, -0.2832, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -4.5000,  0.6875,  3.3750, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -4.7500,  0.6133,  1.7188, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1250, -3.4062,  1.8750,  1.2109, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:02:04,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:02:04,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.91 | bwd_microstep: 85.88 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 84.60 | step_microstep: 1.68
[2025-11-06 19:02:04,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.40 | bwd: 86.74 | bwd_inner: 1.98 | bwd_allreduce: 84.63 | step: 1.75
 89%|████████▉ | 3118/3507 [1:17:17<11:43,  1.81s/it]                                                     {'loss': 0.4393, 'learning_rate': 6.386717782044016e-07, 'epoch': 0.89}
 89%|████████▉ | 3118/3507 [1:17:17<11:43,  1.81s/it]tensor([[-7.5000, -5.7188, -1.0078, -0.0092, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3750, -6.9375, -2.1406,  1.7031, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:04,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.24 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-2.9062,  1.3672,  4.3125, -1.0234, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -2.6250,  2.7344, -0.0069, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594,  0.7812,  3.6250, -1.1484, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -1.4766,  3.3906, -1.7969, -6.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -3.2812,  1.1172,  2.0938, -3.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.7344, -0.0515,  2.0469, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:05,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.27
[2025-11-06 19:02:05,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.13 | bwd_microstep: 1266.17 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 1264.86 | step_microstep: 2.00
[2025-11-06 19:02:05,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.40 | bwd: 1267.13 | bwd_inner: 2.07 | bwd_allreduce: 1264.91 | step: 2.09
 89%|████████▉ | 3119/3507 [1:17:19<11:25,  1.77s/it]                                                     {'loss': 0.2507, 'learning_rate': 6.35427520814178e-07, 'epoch': 0.89}
 89%|████████▉ | 3119/3507 [1:17:19<11:25,  1.77s/it]tensor([[-3.6094,  0.5859,  2.8750, -2.0781, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:05,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.98 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.7500, -4.0938, -1.1484,  3.0000, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.2500,  0.0522,  3.1719,  0.2109, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([2], device='cuda:3')
tensor([[-4.5000, -1.6328,  1.6406,  0.0175, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -1.4219,  3.0938, -1.5078, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4219,  0.3730,  3.6562, -0.0500, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -4.5938, -1.3203,  2.9062, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.9375, -5.5312, -1.1641, -3.6250, -7.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:02:06,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:02:06,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.33 | bwd_microstep: 159.65 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 158.35 | step_microstep: 1.86
[2025-11-06 19:02:06,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 262.33 | bwd: 160.70 | bwd_inner: 2.15 | bwd_allreduce: 158.40 | step: 1.96
 89%|████████▉ | 3120/3507 [1:17:20<08:50,  1.37s/it]                                                     {'loss': 0.0819, 'learning_rate': 6.32191253932235e-07, 'epoch': 0.89}
 89%|████████▉ | 3120/3507 [1:17:20<08:50,  1.37s/it]tensor([[-4.3438, -0.2793,  2.3906, -2.0781, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:06,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.65 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9688, -5.1562, -1.6172,  2.7656, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -4.8438,  0.1855,  1.8672, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.9062, -0.7734,  2.7188, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -3.2188,  1.4453,  1.3047, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8438,  0.8867,  1.6172, -0.8359, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.7812, -4.8750,  0.2188,  0.9609, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -4.8750, -1.5703,  2.8125, -1.7422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:09,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 19:02:09,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.55 | bwd_microstep: 3200.16 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 3199.05 | step_microstep: 2.31
[2025-11-06 19:02:09,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.23 | bwd: 3201.19 | bwd_inner: 1.97 | bwd_allreduce: 3199.10 | step: 2.40
 89%|████████▉ | 3121/3507 [1:17:23<13:09,  2.05s/it]                                                     {'loss': 0.231, 'learning_rate': 6.289629803199837e-07, 'epoch': 0.89}
 89%|████████▉ | 3121/3507 [1:17:23<13:09,  2.05s/it]tensor([[-2.5938, -3.2188, -1.0234,  3.3125, -0.1074]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -2.0312,  2.0312,  0.4590, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:02:10,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.37 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.06
tensor([[-6.1562, -3.0781,  2.5625,  1.1016, -4.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1875,  0.7617,  3.6094,  1.2266, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -5.4062, -0.5820,  2.5000, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8906, -1.7500,  1.6328,  1.0078, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -4.7812, -1.2734,  1.5391, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -4.6250,  0.0771,  2.8438, -3.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:10,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 19:02:10,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.89 | bwd_microstep: 43.43 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 42.30 | step_microstep: 2.10
[2025-11-06 19:02:10,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.31 | bwd: 44.29 | bwd_inner: 1.80 | bwd_allreduce: 42.34 | step: 2.16
 89%|████████▉ | 3122/3507 [1:17:24<10:00,  1.56s/it]                                                     {'loss': 0.2356, 'learning_rate': 6.257427027320129e-07, 'epoch': 0.89}
 89%|████████▉ | 3122/3507 [1:17:24<10:00,  1.56s/it]tensor([[-3.9688, -2.5156,  1.7109,  3.0469, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:10,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.47 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.9453, -1.9453, -1.8672,  1.7188,  0.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.8438, -1.7500,  3.4219, -0.5859, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7656, -0.0796,  3.0938,  1.1250, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -4.1250, -1.5078,  0.8164, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -3.5469,  0.9375,  2.9531, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2031, -1.4688,  1.7422,  1.7812, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -4.9062, -2.9219,  1.4688, -1.3359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:13,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 19:02:13,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.14 | bwd_microstep: 2590.50 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2589.29 | step_microstep: 1.84
[2025-11-06 19:02:13,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.64 | bwd: 2591.42 | bwd_inner: 1.91 | bwd_allreduce: 2589.34 | step: 1.93
 89%|████████▉ | 3123/3507 [1:17:27<12:40,  1.98s/it]                                                     {'loss': 0.5866, 'learning_rate': 6.225304239160856e-07, 'epoch': 0.89}
 89%|████████▉ | 3123/3507 [1:17:27<12:40,  1.98s/it]tensor([[-3.7500, -0.0981,  2.9375, -0.7031, -3.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:13,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.55 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-9.1875, -8.3750, -3.2188,  0.1797, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.7969, -1.1562,  1.3047, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -1.0312,  2.4844, -0.5859, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -1.1016,  2.6094,  0.1904, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6016, -1.0703,  0.7891,  2.7656, -0.2070]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0625,  2.4688,  2.9219, -2.0000, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.8750, -3.4219,  2.5938,  0.8477, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:02:13,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.20
[2025-11-06 19:02:13,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.30 | bwd_microstep: 69.88 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 68.79 | step_microstep: 1.70
[2025-11-06 19:02:13,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.87 | bwd: 70.87 | bwd_inner: 1.90 | bwd_allreduce: 68.83 | step: 1.78
 89%|████████▉ | 3124/3507 [1:17:27<09:37,  1.51s/it]                                                     {'loss': 0.6016, 'learning_rate': 6.193261466131484e-07, 'epoch': 0.89}
 89%|████████▉ | 3124/3507 [1:17:27<09:37,  1.51s/it]tensor([[-6.0938, -4.1562,  1.4766,  2.5938, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -1.4141,  0.9219,  0.4902, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:13,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.80 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -0.5039,  2.8750, -0.9492, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2812, -5.0938,  1.1250,  1.9297, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.2031,  0.7695,  0.6992, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -2.5938,  2.1406,  0.5625, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6562, -4.8125,  1.7031,  1.1094, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5781, -3.5781, -1.0938,  2.1875, -1.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:14,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:02:14,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.46 | bwd_microstep: 132.30 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 131.43 | step_microstep: 1.67
[2025-11-06 19:02:14,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 465.28 | bwd: 133.21 | bwd_inner: 1.62 | bwd_allreduce: 131.47 | step: 1.75
 89%|████████▉ | 3125/3507 [1:17:28<07:57,  1.25s/it]                                                     {'loss': 0.4389, 'learning_rate': 6.161298735573107e-07, 'epoch': 0.89}
 89%|████████▉ | 3125/3507 [1:17:28<07:57,  1.25s/it]tensor([[-3.0312, -0.1777,  4.5000,  3.0625, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:14,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.55 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.4531,  2.2656,  4.9375,  0.6328, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -4.4375, -0.1299,  0.3711, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5625, -3.2656,  2.2969,  0.5781, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.6406,  2.1562,  2.6406, -2.2969, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3438, -0.8750,  3.5625,  3.2656, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -1.6797,  3.1875,  1.7500, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -5.9375, -1.1953,  2.2344, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:02:14,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:02:14,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.45 | bwd_microstep: 111.99 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 110.90 | step_microstep: 1.84
[2025-11-06 19:02:14,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.03 | bwd: 112.88 | bwd_inner: 1.81 | bwd_allreduce: 110.94 | step: 1.93
 89%|████████▉ | 3126/3507 [1:17:28<06:25,  1.01s/it]                                                     {'loss': 0.9209, 'learning_rate': 6.129416074758565e-07, 'epoch': 0.89}
 89%|████████▉ | 3126/3507 [1:17:28<06:25,  1.01s/it]tensor([[-3.9062, -0.0288,  2.4219, -2.1094, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:14,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.42 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.3438, -3.3906,  2.2812,  1.0391, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -3.0625,  1.0781,  2.1094, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -4.8125, -0.1348,  3.1250, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0625, -4.2812, -1.2422,  2.6719, -1.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9062, -4.1250,  1.7109,  0.9766, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -3.8594,  1.5938,  2.2969, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -2.1406,  2.0469,  0.2715, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:02:18,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:02:18,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.62 | bwd_microstep: 3396.32 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 3395.10 | step_microstep: 2.57
[2025-11-06 19:02:18,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 261.06 | bwd: 3397.29 | bwd_inner: 2.02 | bwd_allreduce: 3395.14 | step: 2.65
 89%|████████▉ | 3127/3507 [1:17:32<11:30,  1.82s/it]                                                     {'loss': 0.3433, 'learning_rate': 6.097613510892364e-07, 'epoch': 0.89}
 89%|████████▉ | 3127/3507 [1:17:32<11:30,  1.82s/it]tensor([[-2.4062,  1.6562,  2.9375, -2.0469, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-2.7656,  0.3711,  0.1973, -3.5469, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:18,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.53 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.0625, -0.7539,  2.0156, -0.9219, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -0.9297,  3.2812, -1.1328, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -1.1641,  3.0156, -0.1182, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.8750e+00, -5.8438e+00,  3.9062e-03,  1.0469e+00, -5.3125e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.9805,  2.1406,  1.6328, -2.0625, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.9219,  0.4355,  3.9844, -0.7930, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 19:02:19,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 19:02:19,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.60 | bwd_microstep: 234.00 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 233.06 | step_microstep: 1.86
[2025-11-06 19:02:19,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 270.14 | bwd: 234.90 | bwd_inner: 1.66 | bwd_allreduce: 233.11 | step: 1.95
 89%|████████▉ | 3128/3507 [1:17:32<09:02,  1.43s/it]                                                     {'loss': 0.8736, 'learning_rate': 6.065891071110708e-07, 'epoch': 0.89}
 89%|████████▉ | 3128/3507 [1:17:32<09:02,  1.43s/it]tensor([[-5.2812, -1.2734,  3.2500, -1.0703, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -4.7500, -1.1953,  2.2500, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:19,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.44 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-6.0000, -5.7812, -1.3594,  2.5938, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0078, -1.3516, -0.8477,  1.6328,  0.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3906, -3.3594, -2.4688,  1.6641, -0.0085]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.7188, -3.7969, -0.3105,  3.7656, -1.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -4.8438,  0.7539,  2.5312, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -0.8203,  2.2188,  0.3574, -3.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:02:20,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:02:20,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.52 | bwd_microstep: 1505.39 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 1504.54 | step_microstep: 2.18
[2025-11-06 19:02:20,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.98 | bwd: 1506.22 | bwd_inner: 1.50 | bwd_allreduce: 1504.58 | step: 2.25
 89%|████████▉ | 3129/3507 [1:17:34<09:49,  1.56s/it]                                                     {'loss': 0.3451, 'learning_rate': 6.034248782481389e-07, 'epoch': 0.89}
 89%|████████▉ | 3129/3507 [1:17:34<09:49,  1.56s/it]tensor([[-5.9688, -3.8750,  1.2266,  1.6406, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -3.7500, -2.1406,  2.1562, -0.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4062, -4.9062, -1.9844,  2.5312, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:21,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.82 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.7656, -0.3906,  2.2344, -0.7500, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5000, -4.0000, -1.8984,  2.4219, -0.9102]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-6.7188, -5.4688, -0.2090,  2.5312, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -1.3281,  1.6719, -0.5273, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6562, -3.2500,  1.8828, -0.0864, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:02:21,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:02:21,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.53 | bwd_microstep: 43.47 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 42.47 | step_microstep: 1.71
[2025-11-06 19:02:21,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.38 | bwd: 44.41 | bwd_inner: 1.76 | bwd_allreduce: 42.52 | step: 1.80
 89%|████████▉ | 3130/3507 [1:17:35<07:40,  1.22s/it]                                                     {'loss': 0.6143, 'learning_rate': 6.002686672003821e-07, 'epoch': 0.89}
 89%|████████▉ | 3130/3507 [1:17:35<07:40,  1.22s/it]tensor([[-7.2188, -4.9688,  1.1641,  1.8281, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:21,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.89 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.5742,  2.6875,  1.7188, -2.4062, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.7812, -4.5000, -1.0156,  2.6094, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -3.3438,  2.1094,  1.2734, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -2.8281,  1.2188,  2.5938, -2.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -5.5625, -1.5859,  2.0938, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5000, -2.5469, -2.4688,  1.0625,  0.5039]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -4.7188, -2.0625,  2.1406, -1.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:24,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 19:02:24,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.13 | bwd_microstep: 2674.25 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 2673.22 | step_microstep: 2.08
[2025-11-06 19:02:24,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.04 | bwd: 2675.20 | bwd_inner: 1.79 | bwd_allreduce: 2673.27 | step: 2.17
 89%|████████▉ | 3131/3507 [1:17:38<11:02,  1.76s/it]                                                     {'loss': 0.3584, 'learning_rate': 5.971204766609007e-07, 'epoch': 0.89}
 89%|████████▉ | 3131/3507 [1:17:38<11:02,  1.76s/it]tensor([[-6.0625, -2.6094,  2.7031,  0.3906, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:24,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.67 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6250, -1.3906,  2.2969, -2.4688, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -4.4375, -0.0435,  1.9922, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438, -3.3281,  2.5156,  0.2021, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.8750, -6.1875, -1.4453,  1.9375, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -3.3438,  0.9609,  1.3750, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -2.0625,  2.5156,  1.2734, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.1807,  2.8125,  2.5781, -0.6758, -0.9727]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:02:24,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:02:24,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.17 | bwd_microstep: 94.76 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 93.50 | step_microstep: 1.54
[2025-11-06 19:02:24,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.86 | bwd: 95.68 | bwd_inner: 2.01 | bwd_allreduce: 93.54 | step: 1.63
 89%|████████▉ | 3132/3507 [1:17:38<08:32,  1.37s/it]                                                     {'loss': 0.8005, 'learning_rate': 5.939803093159502e-07, 'epoch': 0.89}
 89%|████████▉ | 3132/3507 [1:17:38<08:32,  1.37s/it]tensor([[-2.1406,  2.0469,  3.2656, -2.1094, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.6875, -0.4004,  0.9961, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -3.2812,  1.1719,  1.0156, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:24,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.21 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.9688,  0.7031,  3.6250, -2.2812, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2500, -4.3438,  0.8047, -0.3145, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.1562, -2.5625,  1.2109,  4.0625, -0.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3125, -5.4062,  0.8984,  2.4375, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -5.0312,  0.2988,  2.7031, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:02:25,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:02:25,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.91 | bwd_microstep: 51.12 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 50.09 | step_microstep: 1.65
[2025-11-06 19:02:25,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 407.15 | bwd: 52.19 | bwd_inner: 1.93 | bwd_allreduce: 50.13 | step: 1.73
 89%|████████▉ | 3133/3507 [1:17:39<06:53,  1.11s/it]                                                     {'loss': 0.4243, 'learning_rate': 5.908481678449407e-07, 'epoch': 0.89}
 89%|████████▉ | 3133/3507 [1:17:39<06:53,  1.11s/it]tensor([[-4.2188, -1.3672,  2.4531,  0.7188, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.8750, -3.2500, -1.3594,  2.5000, -0.5117]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-7.7812, -4.2500,  1.0781, -1.1797, -6.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.7812, -4.0938,  2.1094,  1.8125, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:25,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.51 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.17
tensor([[-5.2188, -1.9219,  2.0469, -0.4746, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -6.0938, -2.2812,  2.7031, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3125,  0.8086,  1.8516, -3.5000, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4062, -4.6875, -1.4531,  2.7969, -1.5391]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:02:25,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:02:25,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.50 | bwd_microstep: 236.48 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 235.44 | step_microstep: 2.40
[2025-11-06 19:02:25,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.02 | bwd: 237.39 | bwd_inner: 1.80 | bwd_allreduce: 235.47 | step: 2.58
 89%|████████▉ | 3134/3507 [1:17:39<06:08,  1.01it/s]                                                     {'loss': 0.679, 'learning_rate': 5.877240549204355e-07, 'epoch': 0.89}
 89%|████████▉ | 3134/3507 [1:17:39<06:08,  1.01it/s]tensor([[-1.8906,  1.9219,  2.6719, -2.5625, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:02:26,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.23 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5625, -4.2188,  0.1543, -0.6094, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0938,  1.6250,  3.6719, -3.0938, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.6562, -3.2344, -1.7266,  2.0938, -0.3262]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -1.9062,  1.6953,  0.7656, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -0.0522,  2.9531, -2.4688, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5625,  0.2832,  2.3750, -0.5117, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -3.6250, -0.2910, -2.3438, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:02:28,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 19:02:28,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 1978.90 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1977.67 | step_microstep: 2.23
[2025-11-06 19:02:28,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 433.68 | bwd: 1979.93 | bwd_inner: 2.09 | bwd_allreduce: 1977.71 | step: 2.31
 89%|████████▉ | 3135/3507 [1:17:42<08:51,  1.43s/it]                                                     {'loss': 0.433, 'learning_rate': 5.846079732081455e-07, 'epoch': 0.89}
 89%|████████▉ | 3135/3507 [1:17:42<08:51,  1.43s/it]tensor([[-4.8438, -1.8516,  1.5469, -0.8008, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -4.1562, -0.3262,  0.7773, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.9688, -5.3438,  0.4531,  0.3906, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-9.0000e+00, -6.5000e+00, -1.8616e-03,  4.2383e-01, -6.3438e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9062, -0.4590,  2.5938, -1.1406, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -5.1250, -0.8867,  2.7812, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:29,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.08 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.1875, -1.4531,  4.0312, -1.2578, -6.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -4.8750, -1.4531,  2.6094, -1.9922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:29,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:02:29,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.10 | bwd_microstep: 1.87 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.77
[2025-11-06 19:02:29,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 372.20 | bwd: 2.87 | bwd_inner: 1.82 | bwd_allreduce: 0.91 | step: 1.86
 89%|████████▉ | 3136/3507 [1:17:43<07:40,  1.24s/it]                                                     {'loss': 0.4159, 'learning_rate': 5.814999253669307e-07, 'epoch': 0.89}
 89%|████████▉ | 3136/3507 [1:17:43<07:40,  1.24s/it]tensor([[-4.6875, -4.0625, -0.1777,  2.5156, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -2.9844,  1.5781,  1.7734, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:29,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.21 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2500, -4.7500,  0.5977,  2.7031, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.0469,  1.7812,  3.2188, -1.5156, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -0.4199,  3.1719, -1.9844, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0625, -1.5000,  2.1719,  0.9844, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3438, -4.0938,  2.4688,  1.1797, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7188,  1.6250,  1.6406, -2.5781, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
[2025-11-06 19:02:31,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:02:31,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.54 | bwd_microstep: 1632.52 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1631.36 | step_microstep: 2.62
[2025-11-06 19:02:31,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.77 | bwd: 1633.41 | bwd_inner: 1.87 | bwd_allreduce: 1631.41 | step: 2.70
 89%|████████▉ | 3137/3507 [1:17:45<09:07,  1.48s/it]                                                     {'loss': 0.7405, 'learning_rate': 5.783999140487939e-07, 'epoch': 0.89}
 89%|████████▉ | 3137/3507 [1:17:45<09:07,  1.48s/it]tensor([[-0.5898,  2.8594,  2.5625, -2.1875, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.3438, -2.6562,  0.8477, -0.5039, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844,  0.5664,  2.4531, -1.3516, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -4.7500, -2.0000,  2.1719, -1.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -1.6250,  3.5312,  0.0312, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438,  0.3359,  2.6406, -2.4531, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:31,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.05 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.8750,  0.5781,  3.3125, -2.1094, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5000, -0.1865,  3.8281, -1.2656, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:02:32,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 19:02:32,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.27 | bwd_microstep: 566.32 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 565.15 | step_microstep: 2.12
[2025-11-06 19:02:32,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 510.33 | bwd: 567.14 | bwd_inner: 1.81 | bwd_allreduce: 565.20 | step: 2.19
 89%|████████▉ | 3138/3507 [1:17:46<08:55,  1.45s/it]                                                     {'loss': 0.1599, 'learning_rate': 5.753079418988817e-07, 'epoch': 0.89}
 89%|████████▉ | 3138/3507 [1:17:46<08:55,  1.45s/it]tensor([[-4.2812, -3.9219, -0.9180,  1.9375, -1.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -3.3750,  1.8672,  0.4473, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:32,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.89 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.9062, -3.3125,  1.3203,  2.4531, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -3.7812,  0.0850,  0.7500, -3.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.3125, -4.6250,  1.6250,  1.5078, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.6875,  0.1611,  3.1094, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1250, -3.9688,  0.5508,  0.5977, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1562, -5.3750, -0.9648,  1.8359, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:34,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 19:02:34,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.05 | bwd_microstep: 1502.34 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1501.08 | step_microstep: 2.25
[2025-11-06 19:02:34,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.97 | bwd: 1503.40 | bwd_inner: 2.14 | bwd_allreduce: 1501.13 | step: 2.33
 90%|████████▉ | 3139/3507 [1:17:48<09:37,  1.57s/it]                                                     {'loss': 0.4637, 'learning_rate': 5.7222401155548e-07, 'epoch': 0.9}
 90%|████████▉ | 3139/3507 [1:17:48<09:37,  1.57s/it]tensor([[-1.5078, -2.4375, -2.3906,  1.0625,  0.4570]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-1.6953,  1.9609,  2.9375, -1.6562, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -4.7188,  0.0557,  1.4688, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -3.2812,  1.2500,  1.9141, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:34,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.73 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.9688, -6.0000,  0.0942,  1.7188, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -5.5312, -1.3125,  2.4375, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.4375, -4.6250,  1.0078,  0.4355, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -6.1562, -2.3750,  1.7734, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:34,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:02:34,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.31 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.01
[2025-11-06 19:02:34,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.06 | bwd: 2.37 | bwd_inner: 1.46 | bwd_allreduce: 0.77 | step: 2.09
 90%|████████▉ | 3140/3507 [1:17:48<07:24,  1.21s/it]                                                     {'loss': 1.0772, 'learning_rate': 5.691481256500164e-07, 'epoch': 0.9}
 90%|████████▉ | 3140/3507 [1:17:48<07:24,  1.21s/it]tensor([[-4.4688, -0.7695,  2.1875, -1.5625, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2188,  1.8594,  3.0781, -2.2969, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:02:35,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.34 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5938, -4.0000,  0.0579,  1.2812, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8281,  0.6562,  2.9531, -0.4121, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -2.4531,  1.9297,  0.3301, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -4.0625, -0.3145,  1.5156, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.8438, -6.1250, -0.1084,  1.7500, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -1.4688,  2.4688, -0.9844, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:02:38,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 19:02:38,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.52 | bwd_microstep: 2805.03 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 2804.11 | step_microstep: 2.87
[2025-11-06 19:02:38,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 433.89 | bwd: 2805.97 | bwd_inner: 1.65 | bwd_allreduce: 2804.17 | step: 2.96
 90%|████████▉ | 3141/3507 [1:17:52<11:13,  1.84s/it]                                                     {'loss': 0.4674, 'learning_rate': 5.660802868070525e-07, 'epoch': 0.9}
 90%|████████▉ | 3141/3507 [1:17:52<11:13,  1.84s/it]tensor([[-2.3906,  1.3047,  3.7656, -0.3613, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:38,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.62 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
tensor([[-6.2188, -5.5625, -0.4336,  3.2031, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -3.4844,  0.0635,  3.6719, -1.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250,  0.4141,  4.0625, -1.6719, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0625, -5.7500, -1.3203,  2.7500, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3125, -4.3125,  2.2188,  1.5000, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7500, -5.0938, -0.7930,  2.4062, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9688, -1.1797,  0.7500, -1.4688, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:02:38,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:02:38,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.30 | bwd_microstep: 100.51 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 99.56 | step_microstep: 2.19
[2025-11-06 19:02:38,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.94 | bwd: 101.60 | bwd_inner: 1.75 | bwd_allreduce: 99.63 | step: 2.32
 90%|████████▉ | 3142/3507 [1:17:52<08:40,  1.43s/it]                                                     {'loss': 0.1126, 'learning_rate': 5.630204976442787e-07, 'epoch': 0.9}
 90%|████████▉ | 3142/3507 [1:17:52<08:40,  1.43s/it]tensor([[-5.5938, -2.8281,  1.3750,  0.0093, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -4.1562,  0.3301,  2.3438, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -1.4453,  3.1094,  0.5781, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2656, -2.4219,  0.0835,  3.6875, -0.0913]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -0.6367,  4.0312, -0.7109, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:38,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.14 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -2.4688,  2.3750,  2.6875, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -3.2500,  2.4531, -0.0242, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -1.7031,  3.3125, -0.4121, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:39,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:02:39,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.34 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.11
[2025-11-06 19:02:39,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 535.52 | bwd: 2.74 | bwd_inner: 1.65 | bwd_allreduce: 0.92 | step: 2.19
 90%|████████▉ | 3143/3507 [1:17:53<07:07,  1.18s/it]                                                     {'loss': 0.4014, 'learning_rate': 5.599687607725235e-07, 'epoch': 0.9}
 90%|████████▉ | 3143/3507 [1:17:53<07:07,  1.18s/it]tensor([[-5.7500, -1.7344,  2.0469, -1.9453, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188,  0.2559,  2.1562, -2.6406, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.8750, -6.6875, -4.5000,  0.5742, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.1562, -3.2812,  0.9727,  1.4531, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7188, -3.3125, -0.5508,  0.3418, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -2.5625,  1.0391,  0.5859, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:41,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.54 | bwd_microstep: 11.17 | bwd_inner_microstep: 11.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-6.5938, -4.7812, -0.5352,  0.0425, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4531, -1.1250,  2.2188,  0.6914, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:02:44,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.22 | optimizer_step: 0.34
[2025-11-06 19:02:44,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 380.96 | bwd_microstep: 2066.68 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 2065.77 | step_microstep: 2.48
[2025-11-06 19:02:44,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 559.53 | bwd: 2077.85 | bwd_inner: 11.83 | bwd_allreduce: 2065.85 | step: 2.59
 90%|████████▉ | 3144/3507 [1:17:57<13:38,  2.26s/it]                                                     {'loss': 1.0177, 'learning_rate': 5.569250787957425e-07, 'epoch': 0.9}
 90%|████████▉ | 3144/3507 [1:17:57<13:38,  2.26s/it]tensor([[-5.3750, -3.4844,  1.2266,  1.8516, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:44,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8281, -4.0938, -1.0078,  3.1562, -1.1641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -4.0938, -1.6016,  1.1719, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9688, -2.2188,  2.6719, -0.3867, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -1.5312,  2.8281,  1.0547, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3125, -2.4375,  2.2188,  1.0781, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3125, -7.1875, -1.0000,  2.2031, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.9375, -5.2500, -0.6211,  0.6328, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:02:44,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.34 | optimizer_step: 0.30
[2025-11-06 19:02:44,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.50 | bwd_microstep: 32.79 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 31.58 | step_microstep: 3.61
[2025-11-06 19:02:44,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 485.73 | bwd: 33.56 | bwd_inner: 1.74 | bwd_allreduce: 31.62 | step: 3.70
 90%|████████▉ | 3145/3507 [1:17:58<10:33,  1.75s/it]                                                     {'loss': 0.5509, 'learning_rate': 5.538894543110185e-07, 'epoch': 0.9}
 90%|████████▉ | 3145/3507 [1:17:58<10:33,  1.75s/it]tensor([[-2.9688, -3.7031, -1.9609,  2.2500, -0.4453]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:44,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.17 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6250, -5.5000, -1.9297,  1.8047, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -4.9688, -1.2656,  1.8750, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.2188,  1.3203,  2.7656, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -4.9375, -2.5781,  1.9922, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562, -2.9688,  1.0469,  1.9062, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -3.5781, -2.7500,  1.3906, -0.1191]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-6.0312, -2.4375,  2.0156, -1.0938, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:02:47,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.28 | optimizer_step: 0.48
[2025-11-06 19:02:47,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.01 | bwd_microstep: 2855.45 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 2854.56 | step_microstep: 3.85
[2025-11-06 19:02:47,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.21 | bwd: 2856.15 | bwd_inner: 1.34 | bwd_allreduce: 2854.65 | step: 3.93
 90%|████████▉ | 3146/3507 [1:18:01<13:09,  2.19s/it]                                                     {'loss': 0.3254, 'learning_rate': 5.508618899085583e-07, 'epoch': 0.9}
 90%|████████▉ | 3146/3507 [1:18:01<13:09,  2.19s/it]tensor([[-11.1875, -10.7500,  -6.4062,  -2.5469,  -7.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:48,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.74 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0000, -3.1094,  1.2969,  1.9844, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6875, -4.2500,  1.7344,  2.0156, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -2.9844,  1.0703,  1.4375, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6562,  0.1738,  4.2188, -2.0312, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6562, -4.8750, -0.9180,  2.0312, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -4.8438, -1.2031,  2.7656, -1.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -2.6875,  0.8359,  1.1562, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:02:53,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.28 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 19:02:53,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.39 | bwd_microstep: 5385.58 | bwd_inner_microstep: 1.66 | bwd_allreduce_microstep: 5383.80 | step_microstep: 4.42
[2025-11-06 19:02:53,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.12 | bwd: 5386.60 | bwd_inner: 2.58 | bwd_allreduce: 5383.85 | step: 4.52
 90%|████████▉ | 3147/3507 [1:18:07<19:30,  3.25s/it]                                                     {'loss': 0.3184, 'learning_rate': 5.478423881716899e-07, 'epoch': 0.9}
 90%|████████▉ | 3147/3507 [1:18:07<19:30,  3.25s/it]tensor([[-0.3691,  3.4375,  5.7812,  1.2109, -1.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5156, -4.2812, -1.7109,  3.2188, -0.6445]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -1.8359,  1.9609, -1.4609, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:53,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.38 | bwd_microstep: 1.73 | bwd_inner_microstep: 1.56 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.16
tensor([[-3.6562, -3.1094, -0.1807,  2.1250, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.0547,  2.5469,  4.5312,  2.6719, -0.2539]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.7969, -0.4277,  1.7969, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4062, -4.5625,  1.7266,  1.2031, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -5.5312, -3.3750,  1.5703, -1.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:54,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.75 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:02:54,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.05 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.87 | step_microstep: 4.14
[2025-11-06 19:02:54,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 444.45 | bwd: 3.69 | bwd_inner: 2.60 | bwd_allreduce: 0.92 | step: 4.30
 90%|████████▉ | 3148/3507 [1:18:07<14:30,  2.43s/it]                                                     {'loss': 0.1536, 'learning_rate': 5.448309516768657e-07, 'epoch': 0.9}
 90%|████████▉ | 3148/3507 [1:18:07<14:30,  2.43s/it]tensor([[-7.0312, -6.4062, -1.8047,  1.3984, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0938, -4.9062,  1.2812,  2.2188, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5938, -5.3125,  0.7422,  1.1484, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -2.3438,  2.9531,  0.8594, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:54,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.91 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.2031, -3.2969, -2.2344,  2.1250,  0.1992]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7188, -5.8125, -1.1250,  1.7656, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7891, -1.4609, -1.2969,  1.6016,  0.8711]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281, -2.4688,  0.5469,  2.4219, -1.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:54,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:02:54,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.40 | bwd_microstep: 1.90 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.74
[2025-11-06 19:02:54,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 386.32 | bwd: 2.81 | bwd_inner: 1.83 | bwd_allreduce: 0.84 | step: 2.82
 90%|████████▉ | 3149/3507 [1:18:08<10:54,  1.83s/it]                                                     {'loss': 0.6397, 'learning_rate': 5.418275829936537e-07, 'epoch': 0.9}
 90%|████████▉ | 3149/3507 [1:18:08<10:54,  1.83s/it]tensor([[-3.2812, -4.0000, -1.7578,  2.9219, -0.4883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -2.5000,  1.9766,  0.7109, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -2.4844,  1.0156, -0.0430, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:54,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.19 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5000, -3.1406,  0.7461,  2.2031, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -3.8438,  0.4805,  1.3984, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9375, -5.1562,  0.6797,  2.1719, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.0625, -4.4375, -1.5312,  2.7656, -1.2891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([3], device='cuda:2')
tensor([[-7.8125, -6.0625, -0.1289,  1.4844, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:02:55,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.15 | optimizer_step: 0.21
[2025-11-06 19:02:55,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.16 | bwd_microstep: 60.44 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 59.31 | step_microstep: 2.62
[2025-11-06 19:02:55,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 422.37 | bwd: 61.28 | bwd_inner: 1.79 | bwd_allreduce: 59.35 | step: 2.71
 90%|████████▉ | 3150/3507 [1:18:08<08:33,  1.44s/it]                                                     {'loss': 0.5167, 'learning_rate': 5.388322846847371e-07, 'epoch': 0.9}
 90%|████████▉ | 3150/3507 [1:18:08<08:33,  1.44s/it]tensor([[-7.8125, -6.1875, -0.2637,  1.8438, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9531, -4.6562, -2.3125,  2.1875, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:55,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.22 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9375, -3.9844,  0.9258,  1.5859, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -1.8359,  2.7969, -0.1787, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.2500, -4.4062,  1.7578,  1.2656, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7188, -2.5625,  2.5781,  0.5547, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -2.4844,  0.5391, -2.9062, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7578, -2.9375, -2.4688,  1.6562,  0.4160]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 19:02:57,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.39 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:02:57,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 91.19 | bwd_microstep: 1810.52 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 1809.43 | step_microstep: 3.37
[2025-11-06 19:02:57,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 264.43 | bwd: 1811.39 | bwd_inner: 1.78 | bwd_allreduce: 1809.48 | step: 3.46
 90%|████████▉ | 3151/3507 [1:18:10<09:43,  1.64s/it]                                                     {'loss': 0.3594, 'learning_rate': 5.358450593059128e-07, 'epoch': 0.9}
 90%|████████▉ | 3151/3507 [1:18:10<09:43,  1.64s/it]tensor([[-4.7188, -5.3750, -2.5469,  2.1562, -1.6797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -4.1562, -1.3906,  2.3750, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3125, -4.1562, -2.1250,  2.4688, -0.5703]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:02:57,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.81 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.9688, -5.3750,  0.5898,  2.6094, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -1.9062,  1.7266,  0.2617, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.8125, -3.1094,  1.0938,  0.1235, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.3438, -6.1875, -1.7188,  2.6406, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3906, -3.8906, -1.5312,  2.4688, -0.8945]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:02:57,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:02:57,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.33 | bwd_microstep: 135.85 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 134.79 | step_microstep: 1.95
[2025-11-06 19:02:57,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 316.17 | bwd: 136.70 | bwd_inner: 1.74 | bwd_allreduce: 134.83 | step: 2.03
 90%|████████▉ | 3152/3507 [1:18:11<07:39,  1.30s/it]                                                     {'loss': 0.8182, 'learning_rate': 5.32865909406095e-07, 'epoch': 0.9}
 90%|████████▉ | 3152/3507 [1:18:11<07:39,  1.30s/it]tensor([[1.3359, 3.6406, 4.8438, 2.7031, 0.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -3.9531, -1.4297,  0.8633, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -0.8438,  2.1094,  0.6445, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938,  0.1553,  3.4375, -1.8594, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:02:57,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.27 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5625, -6.1875, -3.3594,  1.2969, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5938, -1.3438,  2.5938, -2.0156, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1250, -3.6406,  1.8203,  1.8125, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0938, -4.2812, -0.5156, -0.2344, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:03:00,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:03:00,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.32 | bwd_microstep: 2125.30 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 2124.20 | step_microstep: 1.91
[2025-11-06 19:03:00,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 467.63 | bwd: 2126.22 | bwd_inner: 1.84 | bwd_allreduce: 2124.25 | step: 2.00
 90%|████████▉ | 3153/3507 [1:18:14<10:01,  1.70s/it]                                                     {'loss': 0.3073, 'learning_rate': 5.298948375272984e-07, 'epoch': 0.9}
 90%|████████▉ | 3153/3507 [1:18:14<10:01,  1.70s/it]tensor([[-6.1875, -4.4375,  0.2793,  1.2734, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.9766,  1.2188,  1.3906,  0.1147, -0.9102]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:03:00,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.65 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4062, -0.8633,  2.0625, -1.0312, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -4.3125, -0.4180,  3.1875, -1.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2031,  1.0000,  3.3906, -1.7031, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -2.0312,  2.0156, -0.7773, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3438, -3.9219,  0.7227, -2.1250, -6.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -0.8477,  2.2188, -0.2324, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:03:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.66 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:03:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.32 | bwd_microstep: 47.15 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 45.94 | step_microstep: 2.10
[2025-11-06 19:03:00,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 306.99 | bwd: 48.05 | bwd_inner: 1.94 | bwd_allreduce: 45.97 | step: 2.17
 90%|████████▉ | 3154/3507 [1:18:14<07:40,  1.31s/it]                                                     {'loss': 0.2284, 'learning_rate': 5.269318462046502e-07, 'epoch': 0.9}
 90%|████████▉ | 3154/3507 [1:18:14<07:40,  1.31s/it]tensor([[-3.9375, -3.7344, -0.5781,  2.5938, -1.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -4.4375,  0.9375,  1.7969, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4375, -3.4844,  2.1094, -1.3125, -6.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:01,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.48 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-9.1250, -7.9375, -2.8125,  0.0703, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -3.8125, -0.4062,  4.7188, -0.3828]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5312, -4.5312,  0.9414,  1.7422, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594,  0.0309,  3.0938, -1.1016, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.6875, -0.9727,  2.7188, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:03:02,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 19:03:02,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.16 | bwd_microstep: 1290.74 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1289.43 | step_microstep: 2.37
[2025-11-06 19:03:02,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 452.67 | bwd: 1291.65 | bwd_inner: 2.02 | bwd_allreduce: 1289.49 | step: 2.46
 90%|████████▉ | 3155/3507 [1:18:16<08:41,  1.48s/it]                                                     {'loss': 0.1235, 'learning_rate': 5.239769379663818e-07, 'epoch': 0.9}
 90%|████████▉ | 3155/3507 [1:18:16<08:41,  1.48s/it]tensor([[-7.1562, -4.8750, -0.0094, -0.0114, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -4.5312,  0.1157,  2.3438, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0938, -4.2812,  1.6172,  0.6914, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:03,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.71 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.3125, -1.9219,  0.9961,  1.8203, -1.8984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -4.0938,  0.5820,  2.6250, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031, -0.4043,  1.3047, -2.2969, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.9062, -1.7031,  2.9219, -1.3594, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7188, -3.4531,  0.7812,  0.5273, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:03:03,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:03:03,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.37 | bwd_microstep: 441.42 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 440.36 | step_microstep: 1.67
[2025-11-06 19:03:03,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 434.04 | bwd: 442.35 | bwd_inner: 1.82 | bwd_allreduce: 440.40 | step: 1.76
 90%|████████▉ | 3156/3507 [1:18:17<08:23,  1.43s/it]                                                     {'loss': 0.9044, 'learning_rate': 5.210301153338293e-07, 'epoch': 0.9}
 90%|████████▉ | 3156/3507 [1:18:17<08:23,  1.43s/it]tensor([[-3.6094, -1.0000,  3.1406,  1.9297, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:04,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.30 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -5.1250, -2.5781,  2.1406, -1.4609]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.5625, -9.0625, -4.3125, -0.2539, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -5.0938, -0.8125,  2.1562, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -1.0781,  3.2812, -0.3984, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5625, -4.8750, -0.2295,  1.0156, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -4.2812,  0.2793,  2.6406, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6562, -3.3438,  2.3906,  0.6094, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:03:05,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 19:03:05,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.83 | bwd_microstep: 1692.03 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1690.81 | step_microstep: 2.17
[2025-11-06 19:03:05,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.16 | bwd: 1693.26 | bwd_inner: 2.28 | bwd_allreduce: 1690.85 | step: 2.25
 90%|█████████ | 3157/3507 [1:18:19<09:26,  1.62s/it]                                                     {'loss': 0.1177, 'learning_rate': 5.180913808214283e-07, 'epoch': 0.9}
 90%|█████████ | 3157/3507 [1:18:19<09:26,  1.62s/it]tensor([[-6.8438, -5.5312, -0.2578,  2.3906, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125,  0.4102,  2.4531, -2.3750, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:06,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.78 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7656, -3.0312,  0.6836,  3.1250, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250, -3.0469,  0.3574,  2.5000, -2.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -5.3750, -2.7344,  2.4844, -1.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.2969,  1.0547,  3.1406, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.8438, -0.2891,  2.1094, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3125, -5.0312, -0.0635,  2.2812, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:03:06,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 19:03:06,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.44 | bwd_microstep: 742.29 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 741.42 | step_microstep: 1.50
[2025-11-06 19:03:06,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.24 | bwd: 743.21 | bwd_inner: 1.61 | bwd_allreduce: 741.46 | step: 1.57
 90%|█████████ | 3158/3507 [1:18:20<08:27,  1.45s/it]                                                     {'loss': 0.0957, 'learning_rate': 5.151607369367095e-07, 'epoch': 0.9}
 90%|█████████ | 3158/3507 [1:18:20<08:27,  1.45s/it]tensor([[-3.2031, -2.0469,  1.8203,  3.9531, -1.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.9062, -0.5430,  2.3281, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9297, -3.1094, -2.6562,  1.5938,  0.4492]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.2656,  1.7656,  4.2188, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:07,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.36 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2500, -4.9062,  0.4219,  2.6875, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -3.7812, -0.1680,  3.0938, -1.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4688, -3.0312,  2.6719,  0.3633, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -4.5625,  0.0133,  2.9531, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:03:08,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:03:08,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.38 | bwd_microstep: 1003.79 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1002.78 | step_microstep: 2.20
[2025-11-06 19:03:08,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 515.77 | bwd: 1004.66 | bwd_inner: 1.72 | bwd_allreduce: 1002.82 | step: 2.28
 90%|█████████ | 3159/3507 [1:18:22<08:37,  1.49s/it]                                                     {'loss': 0.1114, 'learning_rate': 5.122381861803039e-07, 'epoch': 0.9}
 90%|█████████ | 3159/3507 [1:18:22<08:37,  1.49s/it]tensor([[-3.7969, -2.8750,  0.0200,  1.8594, -1.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:08,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.97 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.5781, -3.0625, -0.6133,  3.6562, -0.1196]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -4.5000, -0.1982,  1.8750, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -2.1562,  1.5000,  1.4766, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -3.8438,  0.2559,  2.0000, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0312, -4.5000, -0.5039,  0.7422, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5781, -3.5625,  0.1138,  4.3438, -0.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0312, -2.5781,  1.4766,  0.8242, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:03:09,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:03:09,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.89 | bwd_microstep: 1100.30 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1099.09 | step_microstep: 1.65
[2025-11-06 19:03:09,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 248.87 | bwd: 1101.28 | bwd_inner: 1.97 | bwd_allreduce: 1099.14 | step: 1.75
 90%|█████████ | 3160/3507 [1:18:23<08:25,  1.46s/it]                                                     {'loss': 0.4629, 'learning_rate': 5.093237310459387e-07, 'epoch': 0.9}
 90%|█████████ | 3160/3507 [1:18:23<08:25,  1.46s/it]tensor([[-4.0625, -4.1250, -0.4297,  3.8281, -1.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0625, -5.7812, -1.3203,  0.5234, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.9062, -4.4375,  0.1748,  1.8047, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -2.3750,  2.2969,  0.9062, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -2.0469,  1.1250, -0.3398, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:10,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.44 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.9766,  2.5469,  5.6250, -0.4238, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2812, -4.2500,  1.6641,  0.5938, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -1.5938,  3.5469, -0.1377, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:03:10,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:03:10,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.08 | bwd_microstep: 70.28 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 69.19 | step_microstep: 1.72
[2025-11-06 19:03:10,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 679.55 | bwd: 71.39 | bwd_inner: 2.00 | bwd_allreduce: 69.24 | step: 1.81
 90%|█████████ | 3161/3507 [1:18:24<07:16,  1.26s/it]                                                     {'loss': 0.6846, 'learning_rate': 5.064173740204292e-07, 'epoch': 0.9}
 90%|█████████ | 3161/3507 [1:18:24<07:16,  1.26s/it]tensor([[-4.9375, -3.9375,  0.0742,  2.3438, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.3438, -3.2188,  0.4902, -1.5625, -5.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:10,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.65 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.3438, -4.3125,  0.2441,  0.4746, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -2.7656,  1.2734,  0.1021, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6875, -5.4375,  0.0381,  2.5312, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -1.9453, -0.5469, -5.1250, -5.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -5.2188, -0.2812,  2.4531, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -2.8750,  0.1367,  2.6719, -1.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:03:13,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.19 | optimizer_step: 0.33
[2025-11-06 19:03:13,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.18 | bwd_microstep: 2449.37 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 2448.18 | step_microstep: 2.42
[2025-11-06 19:03:13,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.86 | bwd: 2450.16 | bwd_inner: 1.75 | bwd_allreduce: 2448.23 | step: 2.52
 90%|█████████ | 3162/3507 [1:18:27<09:57,  1.73s/it]                                                     {'loss': 0.2305, 'learning_rate': 5.035191175836829e-07, 'epoch': 0.9}
 90%|█████████ | 3162/3507 [1:18:27<09:57,  1.73s/it]tensor([[-5.0000, -4.4062, -0.0679,  3.2188, -2.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:13,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 101.75 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6250, -3.5312,  0.5000,  2.4531, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -2.3906,  1.7500,  0.5547, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -4.5000, -0.2070,  3.5781, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -0.0679,  3.9844, -0.9531, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -3.7656,  0.1689,  2.0469, -2.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -2.2500,  1.9688, -2.7344, -6.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -4.0312,  1.1562,  1.4062, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:03:14,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:03:14,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.02 | bwd_microstep: 401.27 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 400.28 | step_microstep: 1.85
[2025-11-06 19:03:14,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.79 | bwd: 402.19 | bwd_inner: 1.73 | bwd_allreduce: 400.33 | step: 1.93
 90%|█████████ | 3163/3507 [1:18:28<08:11,  1.43s/it]                                                     {'loss': 0.4344, 'learning_rate': 5.006289642086948e-07, 'epoch': 0.9}
 90%|█████████ | 3163/3507 [1:18:28<08:11,  1.43s/it]tensor([[-6.5312, -5.3125, -0.3262,  1.9688, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -2.3594,  0.8281,  1.1250, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5625, -5.1250,  1.5156,  2.3594, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:14,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.08 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.4688, -2.7344, -0.3281,  3.4219, -0.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -0.5898,  3.3281,  1.2344, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7188, -3.2188,  2.0156,  1.8438, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562,  0.3555,  3.3438, -1.3594, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -4.7812,  0.4395,  1.4375, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:03:16,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.73 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:03:16,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.31 | bwd_microstep: 1918.46 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1917.19 | step_microstep: 2.62
[2025-11-06 19:03:16,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 435.42 | bwd: 1919.32 | bwd_inner: 1.96 | bwd_allreduce: 1917.24 | step: 2.70
 90%|█████████ | 3164/3507 [1:18:30<09:49,  1.72s/it]                                                     {'loss': 0.7026, 'learning_rate': 4.977469163615456e-07, 'epoch': 0.9}
 90%|█████████ | 3164/3507 [1:18:30<09:49,  1.72s/it]tensor([[-3.6250, -2.4062,  0.9648,  2.3438, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -3.3594,  0.5547,  1.8594, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:16,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.58 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.6719,  0.2197,  3.7656, -0.2656, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -1.1172,  2.2344, -0.7539, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3906,  0.5000,  3.1250, -1.1016, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -3.7188,  0.8867,  3.1719, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -3.0781,  0.6523,  1.2422, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6250,  0.8867,  3.8594, -1.8203, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:03:17,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.21 | optimizer_step: 0.19
[2025-11-06 19:03:17,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.96 | bwd_microstep: 121.58 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 120.30 | step_microstep: 2.14
[2025-11-06 19:03:17,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.57 | bwd: 122.66 | bwd_inner: 2.17 | bwd_allreduce: 120.35 | step: 2.23
 90%|█████████ | 3165/3507 [1:18:30<07:36,  1.34s/it]                                                     {'loss': 0.5258, 'learning_rate': 4.948729765014004e-07, 'epoch': 0.9}
 90%|█████████ | 3165/3507 [1:18:30<07:36,  1.34s/it]tensor([[-3.0625,  0.2461,  2.8281, -0.2354, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -0.1924,  3.0312, -1.3906, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5625,  2.3906,  3.1250, -2.2656, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:17,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.11 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.8125, -3.1250,  0.5469,  3.3125, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -1.7031,  2.1250,  2.4062, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5938, -2.8438,  2.2969, -1.0156, -5.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.6250, -6.1875, -0.3242, -0.0432, -6.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -2.2344,  1.8516,  0.7891, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:03:20,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.27 | optimizer_step: 0.27
[2025-11-06 19:03:20,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.83 | bwd_microstep: 2799.33 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 2798.25 | step_microstep: 3.25
[2025-11-06 19:03:20,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.95 | bwd: 2800.13 | bwd_inner: 1.64 | bwd_allreduce: 2798.32 | step: 3.34
 90%|█████████ | 3166/3507 [1:18:34<10:48,  1.90s/it]                                                     {'loss': 0.3031, 'learning_rate': 4.920071470805055e-07, 'epoch': 0.9}
 90%|█████████ | 3166/3507 [1:18:34<10:48,  1.90s/it][h264 @ 0xdcef840] mmco: unref short failure
[h264 @ 0xdcef840] mmco: unref short failure
tensor([[-3.6875, -0.1279,  2.8594, -0.4785, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:20,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.78 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-0.3477,  0.5703,  2.7031,  3.9531,  0.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8438, -3.5156,  1.5469,  1.5234, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5938,  0.6953,  3.2812, -2.2188, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0000, -5.5625,  0.1128,  2.2656, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4688, -3.9688, -0.5352,  0.2266, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -3.5781, -0.3398,  2.3125, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.5938,  0.4512,  1.8047, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:03:20,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 19:03:20,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.44 | bwd_microstep: 127.63 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 126.59 | step_microstep: 1.92
[2025-11-06 19:03:20,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.22 | bwd: 128.85 | bwd_inner: 2.04 | bwd_allreduce: 126.65 | step: 2.04
 90%|█████████ | 3167/3507 [1:18:34<08:23,  1.48s/it]                                                     {'loss': 0.2518, 'learning_rate': 4.891494305441869e-07, 'epoch': 0.9}
 90%|█████████ | 3167/3507 [1:18:34<08:23,  1.48s/it]tensor([[-5.9688, -5.4375, -1.1484,  2.2344, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -3.4844, -0.2949,  2.2031, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:21,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.85 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.5312, -4.0938,  1.7812, -0.2197, -6.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -3.4844,  1.9453,  0.1216, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.8359, -0.8242,  2.2344,  3.4688, -0.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4688, -1.5859,  3.2344, -0.5469, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5312, -2.6875,  2.7188, -0.4414, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1406, -2.5312, -0.0181,  2.1250, -1.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:03:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.20 | optimizer_step: 0.24
[2025-11-06 19:03:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.96 | bwd_microstep: 1634.33 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1633.18 | step_microstep: 2.22
[2025-11-06 19:03:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 454.83 | bwd: 1635.37 | bwd_inner: 1.98 | bwd_allreduce: 1633.24 | step: 2.31
 90%|█████████ | 3168/3507 [1:18:36<09:31,  1.69s/it]                                                     {'loss': 0.2708, 'learning_rate': 4.862998293308485e-07, 'epoch': 0.9}
 90%|█████████ | 3168/3507 [1:18:36<09:31,  1.69s/it]tensor([[-5.4062, -0.8633,  3.9844, -1.2031, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0938, -2.0156,  0.2676, -0.5234, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:23,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.29 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-6.5000, -3.0781,  2.4688,  0.3594, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3438, -4.4375,  0.6875,  1.5469, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3750, -3.6250, -0.1021,  2.3125, -2.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4688, -3.9062,  1.1875,  0.6328, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1875, -4.4062, -1.2891,  2.7188, -1.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8281, -3.0156,  0.8086,  3.4062, -1.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:03:23,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:03:23,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.25 | bwd_microstep: 160.44 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 158.85 | step_microstep: 1.57
[2025-11-06 19:03:23,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 321.57 | bwd: 161.64 | bwd_inner: 2.55 | bwd_allreduce: 158.90 | step: 1.69
 90%|█████████ | 3169/3507 [1:18:37<07:31,  1.34s/it]                                                     {'loss': 0.5062, 'learning_rate': 4.834583458719721e-07, 'epoch': 0.9}
 90%|█████████ | 3169/3507 [1:18:37<07:31,  1.34s/it]tensor([[-2.7188,  2.0000,  5.0938, -1.1094, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0469, -2.5625,  1.2344,  4.3750, -0.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -2.1562,  2.4688,  1.0703, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:23,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.67 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0625, -5.0938,  0.7383,  1.7266, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -2.7812,  0.5000,  1.5938, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -0.9609,  3.1094, -0.4062, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.5312, -0.2471,  2.8438,  0.2344, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -5.0625, -0.1787,  2.2188, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:03:25,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 19:03:25,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.51 | bwd_microstep: 1428.20 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 1427.22 | step_microstep: 2.10
[2025-11-06 19:03:25,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.20 | bwd: 1429.17 | bwd_inner: 1.76 | bwd_allreduce: 1427.27 | step: 2.18
 90%|█████████ | 3170/3507 [1:18:39<08:24,  1.50s/it]                                                     {'loss': 0.4113, 'learning_rate': 4.806249825921061e-07, 'epoch': 0.9}
 90%|█████████ | 3170/3507 [1:18:39<08:24,  1.50s/it]tensor([[-4.8750, -4.4375, -0.1787,  3.1094, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[4.2188, 6.5312, 5.1562, 2.8125, 2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:03:25,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.73 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1250, -1.8672,  1.9922, -0.6875, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -2.5312,  1.4297, -1.4453, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3906, -1.6641,  1.2578,  3.5312, -0.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5234,  1.2188,  2.9688,  0.7695, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.3594,  0.8086,  3.3125, -1.5938, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5938, -2.6406,  1.1250,  3.4062, -1.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:03:25,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:03:25,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.74 | bwd_microstep: 117.12 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 116.17 | step_microstep: 1.48
[2025-11-06 19:03:25,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.48 | bwd: 118.03 | bwd_inner: 1.64 | bwd_allreduce: 116.22 | step: 1.58
 90%|█████████ | 3171/3507 [1:18:39<06:39,  1.19s/it]                                                     {'loss': 0.3584, 'learning_rate': 4.777997419088731e-07, 'epoch': 0.9}
 90%|█████████ | 3171/3507 [1:18:39<06:39,  1.19s/it]tensor([[-7.8750, -4.5000,  1.1016, -0.6133, -6.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -5.0312, -0.6289,  4.0000, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250e+00, -4.0625e+00, -1.9379e-03,  2.3281e+00, -2.7969e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:26,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.97 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -3.8125, -0.5469,  2.5000, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -5.2812, -0.2793,  2.7344, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2344, -2.4688,  0.6719,  4.6875,  0.0688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -4.7812,  0.0312,  3.1094, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0625, -0.2471,  1.7812, -2.5469, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:03:27,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:03:27,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.26 | bwd_microstep: 754.45 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 753.32 | step_microstep: 2.08
[2025-11-06 19:03:27,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.26 | bwd: 755.23 | bwd_inner: 1.75 | bwd_allreduce: 753.35 | step: 2.16
 90%|█████████ | 3172/3507 [1:18:40<06:35,  1.18s/it]                                                     {'loss': 0.0763, 'learning_rate': 4.749826262329715e-07, 'epoch': 0.9}
 90%|█████████ | 3172/3507 [1:18:40<06:35,  1.18s/it]tensor([[-10.3125,  -8.1250,  -1.6484,  -0.4863,  -7.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -2.1562,  1.4375,  0.9648, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094, -0.0752,  2.0938, -1.6719, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:27,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.08 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6875, -2.6875,  1.7500,  0.2402, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -3.7812,  0.6562,  1.7344, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -3.3906,  0.0603,  2.3594, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -1.9688,  1.0391, -0.5664, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -5.1250, -1.6953,  2.7031, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:03:28,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:03:28,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.17 | bwd_microstep: 506.28 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 505.08 | step_microstep: 1.81
[2025-11-06 19:03:28,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.27 | bwd: 507.16 | bwd_inner: 1.90 | bwd_allreduce: 505.13 | step: 1.89
 90%|█████████ | 3173/3507 [1:18:42<06:54,  1.24s/it]                                                     {'loss': 0.9065, 'learning_rate': 4.721736379681574e-07, 'epoch': 0.9}
 90%|█████████ | 3173/3507 [1:18:42<06:54,  1.24s/it]tensor([[-4.3438, -4.0938, -1.3750,  1.6641, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4688, -0.6836,  2.8125,  1.5234, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -3.1562,  1.3359,  1.5547, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:28,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.65 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.1641,  2.4531,  4.2188, -0.2598, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.5625, -5.2500, -1.4141,  2.1250, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5312, -4.7188, -0.5703,  2.2031, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -1.9453,  1.9375,  0.6719, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -2.7656,  1.6328,  1.1562, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:03:30,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 19:03:30,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.71 | bwd_microstep: 1279.96 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1279.07 | step_microstep: 2.35
[2025-11-06 19:03:30,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.37 | bwd: 1280.79 | bwd_inner: 1.50 | bwd_allreduce: 1279.13 | step: 2.44
 91%|█████████ | 3174/3507 [1:18:43<07:42,  1.39s/it]                                                     {'loss': 0.4688, 'learning_rate': 4.69372779511259e-07, 'epoch': 0.91}
 91%|█████████ | 3174/3507 [1:18:43<07:42,  1.39s/it]tensor([[-3.8281, -2.4375,  0.7578,  1.4531, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:30,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 108.14 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0000, -4.0312, -0.5938,  3.4844, -1.3359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -3.1875,  1.2266,  0.5898, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.4062, -0.9805,  2.2344, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6562,  2.6094,  2.3750, -1.9844, -1.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
tensor([[-5.5312, -5.5312, -1.6172,  2.3281, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0625, -4.9062,  1.2188,  2.1094, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2500, -3.9219,  0.3945,  0.0223, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:03:32,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 19:03:32,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.25 | bwd_microstep: 522.97 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 521.57 | step_microstep: 2.32
[2025-11-06 19:03:32,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.44 | bwd: 523.72 | bwd_inner: 1.91 | bwd_allreduce: 521.62 | step: 2.41
 91%|█████████ | 3175/3507 [1:18:46<09:45,  1.76s/it]                                                     {'loss': 1.2444, 'learning_rate': 4.6658005325216136e-07, 'epoch': 0.91}
 91%|█████████ | 3175/3507 [1:18:46<09:45,  1.76s/it]tensor([[-2.6719,  1.7344,  4.2500, -1.5234, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6406, -4.2812, -1.6094,  3.2188, -0.7695]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3906,  0.6055,  3.5312, -0.4492, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:33,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.43 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.7188, -5.2500, -0.7539,  2.5469, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9531, -0.0413,  2.8438, -1.7812, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5625, -0.7383,  3.5625, -2.3906, -5.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -4.0625, -0.7930,  3.0938, -1.4766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9375, -3.6719,  1.0859,  2.9531, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:03:34,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 19:03:34,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.46 | bwd_microstep: 946.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 946.00 | step_microstep: 7.92
[2025-11-06 19:03:34,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.92 | bwd: 947.54 | bwd_inner: 1.33 | bwd_allreduce: 946.05 | step: 8.00
 91%|█████████ | 3176/3507 [1:18:48<09:07,  1.66s/it]                                                     {'loss': 0.0605, 'learning_rate': 4.6379546157381496e-07, 'epoch': 0.91}
 91%|█████████ | 3176/3507 [1:18:48<09:07,  1.66s/it]tensor([[-3.5938,  0.8438,  3.8125, -1.5703, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9062, -0.3379,  2.4062, -0.9883, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.7578,  3.5781,  4.8125, -1.3438, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5000, -3.0938,  0.6914,  2.2188, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -3.5625,  1.0547,  1.9922, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -1.2656,  1.7734, -0.5273, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.8438, -5.7188,  0.7578,  2.1406, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:36,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.74 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -4.2500, -0.9844,  3.0156, -1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:36,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:03:36,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.13 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.53
[2025-11-06 19:03:36,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 519.86 | bwd: 2.65 | bwd_inner: 1.55 | bwd_allreduce: 0.96 | step: 7.62
 91%|█████████ | 3177/3507 [1:18:50<10:39,  1.94s/it]                                                     {'loss': 0.2868, 'learning_rate': 4.610190068522302e-07, 'epoch': 0.91}
 91%|█████████ | 3177/3507 [1:18:50<10:39,  1.94s/it]tensor([[-4.1250, -4.3438, -1.3750,  2.5781, -1.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406, -4.1250, -2.0156,  1.8594, -1.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.6875, -3.8125,  1.3203,  2.3906, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -1.3750,  2.1250, -0.4141, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:36,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.83 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.0625, -5.1250, -0.8789,  1.6719, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -4.2812,  0.1689,  3.3594, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -4.7812, -0.8281,  3.6562, -1.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.5938, -2.5625, -0.4121,  2.6719, -0.5742]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:03:37,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:03:37,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.17 | bwd_microstep: 107.06 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 105.94 | step_microstep: 2.38
[2025-11-06 19:03:37,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.03 | bwd: 107.78 | bwd_inner: 1.65 | bwd_allreduce: 105.98 | step: 2.47
 91%|█████████ | 3178/3507 [1:18:51<08:14,  1.50s/it]                                                     {'loss': 0.8566, 'learning_rate': 4.5825069145646996e-07, 'epoch': 0.91}
 91%|█████████ | 3178/3507 [1:18:51<08:14,  1.50s/it]tensor([[-2.8594,  0.0977,  1.3516, -1.4375, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -2.5312,  1.5703,  1.8906, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:37,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.21 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.9688,  0.5000,  4.0938, -1.3438, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9531, -4.6250, -1.7734,  3.0156, -1.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4375, -1.2891,  1.4609,  0.7578, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -3.2656, -0.0405,  1.7266, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250,  0.5352,  3.3438, -2.5625, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4844, -2.4688, -0.8516,  3.8594,  0.9414]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:03:39,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 19:03:39,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 76.88 | bwd_microstep: 265.91 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 264.90 | step_microstep: 2.00
[2025-11-06 19:03:39,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.11 | bwd: 266.85 | bwd_inner: 1.74 | bwd_allreduce: 264.95 | step: 2.09
 91%|█████████ | 3179/3507 [1:18:53<08:58,  1.64s/it]                                                     {'loss': 0.2537, 'learning_rate': 4.554905177486557e-07, 'epoch': 0.91}
 91%|█████████ | 3179/3507 [1:18:53<08:58,  1.64s/it]tensor([[-2.9219,  0.4375,  1.4688, -2.3438, -3.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656, -4.1875, -2.8750,  1.4453, -0.6367]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -2.9531,  0.5859,  2.3281, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:39,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.90 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.4062, -2.7969,  2.0156,  1.1641, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -5.6875, -1.3359,  2.1562, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438,  0.9883,  3.4219, -1.2422, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2031,  0.7695,  1.8359, -3.1250, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4375, -2.9844, -2.0312,  1.2734, -0.3809]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 19:03:40,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:03:40,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.59 | bwd_microstep: 939.04 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 937.86 | step_microstep: 1.86
[2025-11-06 19:03:40,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 285.50 | bwd: 940.01 | bwd_inner: 1.99 | bwd_allreduce: 937.89 | step: 1.94
 91%|█████████ | 3180/3507 [1:18:54<08:19,  1.53s/it]                                                     {'loss': 0.5242, 'learning_rate': 4.5273848808396027e-07, 'epoch': 0.91}
 91%|█████████ | 3180/3507 [1:18:54<08:19,  1.53s/it]tensor([[-2.9219, -0.3457,  1.7422, -0.0889, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -4.2812,  0.0493,  2.9531, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -0.6914,  3.6406,  0.5000, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -4.4062, -0.2598,  2.0000, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:40,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.01 | bwd_microstep: 5.77 | bwd_inner_microstep: 5.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-0.9648,  2.1562,  2.2188, -1.5781, -1.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -1.5078,  2.4219,  0.2490, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0938, -4.4375,  1.6172,  1.4375, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -2.6094,  1.6484,  1.7812, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:41,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:03:41,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.59 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.06
[2025-11-06 19:03:41,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.62 | bwd: 7.56 | bwd_inner: 6.54 | bwd_allreduce: 0.86 | step: 2.14
 91%|█████████ | 3181/3507 [1:18:55<07:18,  1.34s/it]                                                     {'loss': 0.6213, 'learning_rate': 4.499946048106085e-07, 'epoch': 0.91}
 91%|█████████ | 3181/3507 [1:18:55<07:18,  1.34s/it]tensor([[-5.9688, -3.2188,  1.0547,  0.0728, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2500, -3.1250,  0.0197,  1.4922, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:41,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.52 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.4180,  1.9766,  2.0156, -0.4336, -0.8984]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -4.3125, -0.0249,  2.0156, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -0.6680,  3.0312, -0.3242, -4.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-0.2021, -1.0391, -0.6133,  2.9688,  1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.5625, -3.5469,  0.1846,  2.0781, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.3125, -4.2812,  1.6562,  0.5625, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:03:42,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:03:42,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.28 | bwd_microstep: 656.75 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 655.78 | step_microstep: 2.03
[2025-11-06 19:03:42,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.84 | bwd: 657.49 | bwd_inner: 1.50 | bwd_allreduce: 655.83 | step: 2.12
 91%|█████████ | 3182/3507 [1:18:56<06:47,  1.25s/it]                                                     {'loss': 0.816, 'learning_rate': 4.4725887026987325e-07, 'epoch': 0.91}
 91%|█████████ | 3182/3507 [1:18:56<06:47,  1.25s/it]tensor([[-4.6250, -4.5938, -1.2812,  2.1719, -2.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6406,  0.5664,  3.5156, -1.5469, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.3984, -1.9219, -0.5117,  3.0781,  0.6016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -0.5898,  3.2188,  1.6406, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -4.0312,  0.1289,  0.1826, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:44,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.47 | bwd_microstep: 4.62 | bwd_inner_microstep: 4.48 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.3906,  0.3477,  0.8281, -1.5859, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-5.3750, -3.4688,  0.4512,  0.4512, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.0079,  2.5625,  3.5000,  1.3828, -0.3555]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:03:45,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.22 | optimizer_step: 0.30
[2025-11-06 19:03:45,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.39 | bwd_microstep: 1057.66 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 1056.76 | step_microstep: 2.49
[2025-11-06 19:03:45,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 383.87 | bwd: 1062.29 | bwd_inner: 5.28 | bwd_allreduce: 1056.84 | step: 2.59
 91%|█████████ | 3183/3507 [1:18:59<10:20,  1.92s/it]                                                     {'loss': 0.4172, 'learning_rate': 4.445312867960727e-07, 'epoch': 0.91}
 91%|█████████ | 3183/3507 [1:18:59<10:20,  1.92s/it]tensor([[-3.4375, -4.0625, -2.1562,  2.0938, -0.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0000, -4.0312, -2.6562,  1.9141, -0.4023]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:46,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.47 | bwd_microstep: 8.17 | bwd_inner_microstep: 8.02 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.2188, -3.0000,  1.4297,  1.0547, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4062, -2.3281,  2.3438,  0.4531, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -2.7969,  3.0625,  0.4102, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -4.5312, -0.7617,  3.0938, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5000, -4.6250,  1.1094,  2.5312, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5312, -2.3594,  2.9062, -1.0234, -5.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:03:46,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:03:46,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.59 | bwd_microstep: 279.61 | bwd_inner_microstep: 6.41 | bwd_allreduce_microstep: 273.08 | step_microstep: 7.01
[2025-11-06 19:03:46,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.08 | bwd: 287.77 | bwd_inner: 14.45 | bwd_allreduce: 273.14 | step: 7.11
 91%|█████████ | 3184/3507 [1:19:00<08:22,  1.55s/it]                                                     {'loss': 0.3887, 'learning_rate': 4.4181185671657634e-07, 'epoch': 0.91}
 91%|█████████ | 3184/3507 [1:19:00<08:22,  1.55s/it]tensor([[-3.7344, -0.0962,  2.7656, -1.0234, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[4.8750, 6.7188, 7.8125, 7.0938, 4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -4.0000,  0.5273,  0.2227, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4688,  1.2578,  2.7031, -1.6172, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -2.1094,  1.4766,  1.0938, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:47,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.22 | bwd_microstep: 1.32 | bwd_inner_microstep: 1.17 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.0312, -4.1875, -0.3125,  2.0000, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7031,  0.6016,  2.8750, -2.3594, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -3.7188,  1.3594,  2.5000, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:03:50,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:03:50,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.86 | bwd_microstep: 3049.32 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 3048.37 | step_microstep: 2.16
[2025-11-06 19:03:50,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 342.10 | bwd: 3050.63 | bwd_inner: 2.04 | bwd_allreduce: 3048.42 | step: 2.26
 91%|█████████ | 3185/3507 [1:19:04<12:23,  2.31s/it]                                                     {'loss': 0.3569, 'learning_rate': 4.391005823517891e-07, 'epoch': 0.91}
 91%|█████████ | 3185/3507 [1:19:04<12:23,  2.31s/it]tensor([[-5.8438, -4.9375, -0.2402,  2.6250, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5938, -1.5469,  1.4922,  0.9570, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -0.0811,  3.6875, -0.6289, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:50,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.09 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -5.0625, -3.5156,  0.7695, -1.4297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.9062, -1.4609,  2.1562, -2.7031, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -4.3750, -1.0547,  3.0938, -1.4141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -1.2422,  3.6250, -1.0625, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -4.8438, -0.7930,  1.7188, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:03:51,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.23 | optimizer_step: 0.19
[2025-11-06 19:03:51,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.97 | bwd_microstep: 84.74 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 83.61 | step_microstep: 2.13
[2025-11-06 19:03:51,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.10 | bwd: 85.42 | bwd_inner: 1.61 | bwd_allreduce: 83.66 | step: 2.21
 91%|█████████ | 3186/3507 [1:19:05<09:26,  1.76s/it]                                                     {'loss': 0.3868, 'learning_rate': 4.3639746601516044e-07, 'epoch': 0.91}
 91%|█████████ | 3186/3507 [1:19:05<09:26,  1.76s/it]tensor([[-6.7188, -3.6094,  2.3125,  0.8984, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -3.0000,  0.2500, -0.9727, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8438, -1.9922,  3.1719, -0.0160, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:51,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.51 | bwd_microstep: 11.18 | bwd_inner_microstep: 11.05 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.2188, -2.6719,  1.8594,  1.3203, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1250, -3.6719,  1.2188,  0.8320, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8438,  0.6992,  3.1719, -0.3965, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.6562,  1.7500,  4.1250, -1.7812, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.0000, -5.6562,  0.4805,  0.9492, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:03:52,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 19:03:52,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.12 | bwd_microstep: 1269.81 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1268.92 | step_microstep: 2.66
[2025-11-06 19:03:52,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.65 | bwd: 1280.99 | bwd_inner: 11.85 | bwd_allreduce: 1268.99 | step: 2.76
 91%|█████████ | 3187/3507 [1:19:06<09:22,  1.76s/it]                                                     {'loss': 0.8333, 'learning_rate': 4.337025100131764e-07, 'epoch': 0.91}
 91%|█████████ | 3187/3507 [1:19:06<09:22,  1.76s/it]tensor([[-5.5625, -1.5391,  2.6406, -1.7266, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.3750, -5.4062, -2.2344,  1.6250, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:53,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.80 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8125, -3.8750,  0.3438,  0.9648, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.6875, -0.3848,  2.0469, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0938, -3.5469,  1.0000,  2.2812, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0000, -3.7031,  1.8594,  0.1846, -5.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.8125,  0.4609,  1.6250, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9375, -2.0312,  2.9375,  1.5859, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:03:54,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 19:03:54,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.35 | bwd_microstep: 688.95 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 688.02 | step_microstep: 2.08
[2025-11-06 19:03:54,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.18 | bwd: 689.71 | bwd_inner: 1.49 | bwd_allreduce: 688.06 | step: 2.16
 91%|█████████ | 3188/3507 [1:19:07<08:17,  1.56s/it]                                                     {'loss': 1.1801, 'learning_rate': 4.3101571664536433e-07, 'epoch': 0.91}
 91%|█████████ | 3188/3507 [1:19:07<08:17,  1.56s/it]tensor([[-1.7500,  1.4531,  2.5781, -0.5898, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.3750, -3.5156,  1.0938,  1.7109, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:03:54,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.08 | bwd_microstep: 5.46 | bwd_inner_microstep: 5.25 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14
tensor([[-6.5625, -4.0625,  1.8594,  1.6094, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7812,  1.2734,  1.4609, -0.6680, -1.1016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.2031,  0.7344,  3.2812, -1.6641, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8906, -0.2090,  3.0625,  5.4375,  0.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -0.7812,  3.7656, -0.8008, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.8750, -9.0000, -4.0312,  1.2266, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:03:56,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.21 | optimizer_step: 0.28
[2025-11-06 19:03:56,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.86 | bwd_microstep: 2146.80 | bwd_inner_microstep: 5.55 | bwd_allreduce_microstep: 2141.13 | step_microstep: 2.52
[2025-11-06 19:03:56,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 235.97 | bwd: 2152.26 | bwd_inner: 10.85 | bwd_allreduce: 2141.21 | step: 2.67
 91%|█████████ | 3189/3507 [1:19:10<09:39,  1.82s/it]                                                     {'loss': 0.5384, 'learning_rate': 4.2833708820428366e-07, 'epoch': 0.91}
 91%|█████████ | 3189/3507 [1:19:10<09:39,  1.82s/it]tensor([[-3.0469, -1.5234,  1.1016,  1.4609, -1.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:56,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.19 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.0625, -5.5938, -0.9922,  0.4414, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[3.6406, 5.4062, 5.8125, 4.3750, 2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.4062, -5.1250, -2.5781,  2.0625, -1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -4.0000, -1.3203,  2.6250, -1.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -4.5000,  0.2295,  1.7031, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2188, -4.3125,  1.2734,  2.3594, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -0.7031,  2.8438, -1.9766, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:03:56,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:03:56,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.04 | bwd_microstep: 123.48 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 122.43 | step_microstep: 1.63
[2025-11-06 19:03:56,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.25 | bwd: 124.50 | bwd_inner: 1.87 | bwd_allreduce: 122.48 | step: 1.72
 91%|█████████ | 3190/3507 [1:19:10<07:28,  1.42s/it]                                                     {'loss': 0.3206, 'learning_rate': 4.256666269755283e-07, 'epoch': 0.91}
 91%|█████████ | 3190/3507 [1:19:10<07:28,  1.42s/it]tensor([[-3.4375, -4.3750, -2.6562,  1.8516, -0.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.9375, -0.4922,  2.4531, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -3.7812, -0.6797,  3.9688, -0.5195]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:03:57,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.33 | bwd_microstep: 1.13 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2500, -3.4219,  0.1436,  2.3281, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7188, -4.1250,  0.5234,  1.9609, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -3.4375,  0.8906,  1.2188, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7500, -3.5469, -1.2656,  3.4375, -0.1504]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -5.0938, -2.1562,  2.6406, -1.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:04:00,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 19:04:00,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.10 | bwd_microstep: 2840.78 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2839.57 | step_microstep: 1.98
[2025-11-06 19:04:00,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.45 | bwd: 2841.92 | bwd_inner: 2.18 | bwd_allreduce: 2839.61 | step: 2.06
 91%|█████████ | 3191/3507 [1:19:13<10:19,  1.96s/it]                                                     {'loss': 0.1815, 'learning_rate': 4.230043352377222e-07, 'epoch': 0.91}
 91%|█████████ | 3191/3507 [1:19:13<10:19,  1.96s/it]tensor([[-4.2812, -2.1406,  1.0547,  0.7305, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -1.4375,  1.8438, -0.8594, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:00,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.50 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0312, -4.9062, -0.0510,  2.4219, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.2812,  0.4766,  2.6562, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -4.5312, -1.1875,  2.1094, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8125, -0.2451,  2.9375, -0.9453, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -4.3125,  0.6523,  1.6250, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7812, -2.8281,  3.2969,  0.1416, -5.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:00,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:04:00,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.10 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.81
[2025-11-06 19:04:00,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 459.63 | bwd: 2.94 | bwd_inner: 2.06 | bwd_allreduce: 0.76 | step: 1.90
 91%|█████████ | 3192/3507 [1:19:14<07:59,  1.52s/it]                                                     {'loss': 0.1703, 'learning_rate': 4.203502152625172e-07, 'epoch': 0.91}
 91%|█████████ | 3192/3507 [1:19:14<07:59,  1.52s/it]tensor([[-4.8438, -4.6875, -1.3750,  2.1094, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1562, -5.2188, -0.7891,  1.7109, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.6875, -1.1172,  0.9688, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:00,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.09 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.3594, -0.3164,  2.5625,  0.2197, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0625, -4.5625,  1.1328,  1.0703, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -0.9922,  3.1562, -0.0096, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6562,  0.6719,  3.9062, -1.3906, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -2.9219,  1.4141,  2.1406, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:04:01,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:04:01,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.70 | bwd_microstep: 459.20 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 458.16 | step_microstep: 1.89
[2025-11-06 19:04:01,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 420.83 | bwd: 460.10 | bwd_inner: 1.78 | bwd_allreduce: 458.19 | step: 1.96
 91%|█████████ | 3193/3507 [1:19:15<07:01,  1.34s/it]                                                     {'loss': 0.5808, 'learning_rate': 4.1770426931459605e-07, 'epoch': 0.91}
 91%|█████████ | 3193/3507 [1:19:15<07:01,  1.34s/it]tensor([[-4.7812, -2.5625,  1.1797,  0.9258, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438,  0.6055,  4.6250, -1.7891, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-9.0000, -7.8438, -2.1250,  1.2266, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6562,  0.3438,  1.8438, -1.1406, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:04:01,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.09 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.6250, -3.1875,  1.6641,  1.7656, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5000, -6.0000, -2.0938,  1.3125, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.5312, -4.8438,  1.8281,  1.7344, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4375, -0.7305,  1.9375, -1.7891, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:04:02,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:04:02,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.42 | bwd_microstep: 128.25 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 127.24 | step_microstep: 1.89
[2025-11-06 19:04:02,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.56 | bwd: 129.30 | bwd_inner: 1.84 | bwd_allreduce: 127.29 | step: 1.98
 91%|█████████ | 3194/3507 [1:19:15<05:43,  1.10s/it]                                                     {'loss': 0.4924, 'learning_rate': 4.1506649965166403e-07, 'epoch': 0.91}
 91%|█████████ | 3194/3507 [1:19:15<05:43,  1.10s/it]tensor([[-3.4688, -0.5195,  2.2812, -0.1523, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -0.4141,  3.4219, -1.3125, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:02,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.91 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3750, -5.0312, -0.3867,  3.5000, -2.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -4.1250,  0.8828,  0.8984, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.8750, -6.7188, -2.4219,  1.8516, -3.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -0.3906,  3.7656, -1.8828, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688, -3.2969, -0.7031,  3.2656, -0.5352]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -5.0312, -1.3906,  2.5625, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:04:05,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.20 | optimizer_step: 0.29
[2025-11-06 19:04:05,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.25 | bwd_microstep: 2649.52 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2648.46 | step_microstep: 2.26
[2025-11-06 19:04:05,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.19 | bwd: 2650.43 | bwd_inner: 1.74 | bwd_allreduce: 2648.52 | step: 2.35
 91%|█████████ | 3195/3507 [1:19:19<08:48,  1.69s/it]                                                     {'loss': 0.1271, 'learning_rate': 4.1243690852445174e-07, 'epoch': 0.91}
 91%|█████████ | 3195/3507 [1:19:19<08:48,  1.69s/it]tensor([[-5.5000, -3.7500,  1.0391,  2.2031, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:05,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.23 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2188, -5.3438, -1.2734,  1.3828, -3.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3438, -5.1562, -0.1069,  2.2969, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0000, -3.4844,  1.8359,  1.4688, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062,  1.8438,  2.9844, -2.3438, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -0.9805,  2.6719, -1.4297, -4.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -4.3750,  0.5312,  2.6562, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.5938,  0.5625,  2.4844, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:04:05,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 19:04:05,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.79 | bwd_microstep: 126.25 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 125.26 | step_microstep: 1.57
[2025-11-06 19:04:05,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.04 | bwd: 127.16 | bwd_inner: 1.72 | bwd_allreduce: 125.30 | step: 1.65
 91%|█████████ | 3196/3507 [1:19:19<06:53,  1.33s/it]                                                     {'loss': 0.4351, 'learning_rate': 4.0981549817670883e-07, 'epoch': 0.91}
 91%|█████████ | 3196/3507 [1:19:19<06:53,  1.33s/it]tensor([[-3.7656, -1.7969,  2.0781,  1.9531, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2969,  0.3008,  4.0000,  0.7539, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:05,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.95 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.8438, -4.1250,  0.2061,  1.0859, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -5.0000, -2.0781,  2.2500, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8438, -1.7812,  3.1562, -1.0938, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -0.3438,  3.2969, -1.7266, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6875,  0.0596,  4.2188,  0.4629, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -5.5625, -2.9844,  1.5312, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:04:08,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.21 | optimizer_step: 0.22
[2025-11-06 19:04:08,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 109.93 | bwd_microstep: 2600.10 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 2598.97 | step_microstep: 2.23
[2025-11-06 19:04:08,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 239.89 | bwd: 2601.00 | bwd_inner: 1.84 | bwd_allreduce: 2599.02 | step: 2.32
 91%|█████████ | 3197/3507 [1:19:22<09:15,  1.79s/it]                                                     {'loss': 0.5773, 'learning_rate': 4.0720227084520613e-07, 'epoch': 0.91}
 91%|█████████ | 3197/3507 [1:19:22<09:15,  1.79s/it]tensor([[-3.9219, -4.3438, -1.4844,  2.7969, -1.1328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:08,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 117.60 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7188, -5.1875, -1.9922,  2.6094, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625,  0.1016,  3.5625, -2.6094, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4688, -5.9062, -1.1797,  2.4688, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -4.5625, -1.6562,  2.3281, -1.5859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -4.4688, -1.5625,  2.2344, -1.6953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -1.1641,  2.9844, -0.5781, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -3.3906, -0.1030,  2.6094, -1.7734]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:04:08,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:04:08,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.86 | bwd_microstep: 62.03 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 60.86 | step_microstep: 1.54
[2025-11-06 19:04:08,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.48 | bwd: 62.81 | bwd_inner: 1.78 | bwd_allreduce: 60.89 | step: 1.61
 91%|█████████ | 3198/3507 [1:19:22<07:05,  1.38s/it]                                                     {'loss': 0.0412, 'learning_rate': 4.045972287597333e-07, 'epoch': 0.91}
 91%|█████████ | 3198/3507 [1:19:22<07:05,  1.38s/it]tensor([[-3.9375, -3.9375, -1.0000,  2.5469, -1.4922]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:09,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.08 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.7500, -4.0312, -1.1953,  3.0938, -1.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6250, -1.7500,  3.0312, -0.7734, -5.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2812, -3.2969, -0.1445,  1.4531, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -4.6875,  0.2480,  1.4844, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5312, -3.4219, -2.4062,  1.6406, -0.2197]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-5.1875, -3.2812,  1.5625,  2.2188, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5781,  1.3984,  2.0469, -3.2188, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 19:04:10,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 19:04:10,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.93 | bwd_microstep: 1285.73 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1284.67 | step_microstep: 2.64
[2025-11-06 19:04:10,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.04 | bwd: 1286.65 | bwd_inner: 1.80 | bwd_allreduce: 1284.72 | step: 2.73
 91%|█████████ | 3199/3507 [1:19:24<07:33,  1.47s/it]                                                     {'loss': 0.5143, 'learning_rate': 4.0200037414309225e-07, 'epoch': 0.91}
 91%|█████████ | 3199/3507 [1:19:24<07:33,  1.47s/it]tensor([[-5.1250, -1.1875,  3.5781, -0.5117, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-9.6875, -9.0000, -3.7656, -0.0388, -5.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.5781, -0.2598,  1.6094, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -4.1250,  0.8594,  2.0000, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:10,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.94 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.7500,  2.2656,  3.1094, -2.2812, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969, -3.1875, -2.5000,  1.3828,  0.0255]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-5.8125, -4.6250, -0.2734,  2.1719, -3.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -0.5781,  2.1875, -2.0000, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:11,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.28 | optimizer_step: 0.36
[2025-11-06 19:04:11,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.82 | bwd_microstep: 4.03 | bwd_inner_microstep: 2.33 | bwd_allreduce_microstep: 1.51 | step_microstep: 2.91
[2025-11-06 19:04:11,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.81 | bwd: 4.77 | bwd_inner: 2.98 | bwd_allreduce: 1.55 | step: 2.99
 91%|█████████ | 3200/3507 [1:19:24<05:52,  1.15s/it]                                                     {'loss': 0.4758, 'learning_rate': 3.9941170921110386e-07, 'epoch': 0.91}
 91%|█████████ | 3200/3507 [1:19:24<05:52,  1.15s/it]tensor([[-4.0625, -3.7969, -1.0938,  2.0625, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-6.0938, -5.6562, -1.6016,  1.8359, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:11,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.75 | bwd_microstep: 1.09 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3750, -4.5312, -1.2266,  2.7656, -1.6484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5469, -3.1562, -1.3516,  2.7188, -0.1309]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -1.8516,  1.6719,  0.2695, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.0000, -4.6250,  1.7969, -0.1328, -6.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9688, -4.3125,  0.2773,  1.5000, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -5.3750, -0.5508,  1.6484, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:04:12,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 19:04:12,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.73 | bwd_microstep: 1448.98 | bwd_inner_microstep: 2.05 | bwd_allreduce_microstep: 1446.76 | step_microstep: 2.99
[2025-11-06 19:04:12,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 388.52 | bwd: 1450.05 | bwd_inner: 3.07 | bwd_allreduce: 1446.79 | step: 3.06
 91%|█████████▏| 3201/3507 [1:19:26<06:58,  1.37s/it]                                                     {'loss': 0.5903, 'learning_rate': 3.968312361725968e-07, 'epoch': 0.91}
 91%|█████████▏| 3201/3507 [1:19:26<06:58,  1.37s/it]tensor([[-3.3906, -4.2500, -2.9219,  1.2578, -0.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:13,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.93 | bwd_microstep: 1.62 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.20
tensor([[-4.5000, -2.4062,  1.6250,  1.5859, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.2500, -4.2500,  1.8672, -1.2344, -7.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.3438, -2.9219, -0.8945,  3.1094, -0.0413]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9062, -3.5469,  0.0063, -0.6445, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0625, -4.8750,  0.1172,  2.6094, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.1025,  3.4062,  2.3281, -2.3750, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.0781, -2.5000, -0.4531,  3.4062,  0.1162]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:04:13,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 19:04:13,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.31 | bwd_microstep: 479.51 | bwd_inner_microstep: 1.57 | bwd_allreduce_microstep: 477.79 | step_microstep: 4.52
[2025-11-06 19:04:13,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 238.28 | bwd: 481.14 | bwd_inner: 3.12 | bwd_allreduce: 477.84 | step: 4.72
 91%|█████████▏| 3202/3507 [1:19:27<06:01,  1.19s/it]                                                     {'loss': 0.2327, 'learning_rate': 3.94258957229412e-07, 'epoch': 0.91}
 91%|█████████▏| 3202/3507 [1:19:27<06:01,  1.19s/it]tensor([[-4.5312, -1.3906,  2.3750, -0.0645, -3.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:13,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.62 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4375, -1.2344,  0.6328, -0.2656, -2.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.4062, -3.6406, -0.5625,  1.4922, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8125, -5.3750, -0.8594,  2.8594, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -4.1562,  0.1660,  3.5938, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4062, -6.5312, -2.4375,  2.2656, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3438, -4.4375, -0.2324,  2.0625, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000e+00, -4.0938e+00, -4.8218e-03,  2.3281e+00, -2.6562e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:04:16,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.36 | optimizer_gradients: 0.25 | optimizer_step: 0.27
[2025-11-06 19:04:16,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.15 | bwd_microstep: 2171.71 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 2170.73 | step_microstep: 4.33
[2025-11-06 19:04:16,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.79 | bwd: 2172.56 | bwd_inner: 1.63 | bwd_allreduce: 2170.77 | step: 4.40
 91%|█████████▏| 3203/3507 [1:19:30<08:00,  1.58s/it]                                                     {'loss': 0.3603, 'learning_rate': 3.916948745763938e-07, 'epoch': 0.91}
 91%|█████████▏| 3203/3507 [1:19:30<08:00,  1.58s/it]tensor([[-4.1250, -2.5781,  1.4219,  2.7188, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -3.6406,  1.0391,  2.2344, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:16,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.46 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -4.0312,  0.5469,  2.5938, -2.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -2.6562,  0.7109, -1.8359, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -2.1875,  1.5703,  1.6719, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -4.2812, -0.7539,  2.1094, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -4.3125, -0.2422,  1.6484, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9062, -5.3125, -1.2344,  1.7812, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:04:17,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:04:17,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.77 | bwd_microstep: 836.05 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 834.92 | step_microstep: 1.92
[2025-11-06 19:04:17,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.26 | bwd: 836.90 | bwd_inner: 1.82 | bwd_allreduce: 834.96 | step: 1.99
 91%|█████████▏| 3204/3507 [1:19:31<07:31,  1.49s/it]                                                     {'loss': 0.3749, 'learning_rate': 3.89138990401402e-07, 'epoch': 0.91}
 91%|█████████▏| 3204/3507 [1:19:31<07:31,  1.49s/it]tensor([[-5.0938, -2.3594,  1.5156,  0.0728, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -0.8984,  3.5000, -0.6406, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:17,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.48 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.9531, -5.0312, -3.2812,  1.6875, -0.9141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.7500, -5.5938, -1.2109, -2.9844, -7.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -4.2812,  0.6445,  2.2344, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.4062, -3.1406,  1.8828,  2.0938, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6875, -2.5625,  1.8125,  2.0938, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -2.4844,  1.4922, -0.4805, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:19,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:04:19,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.89 | bwd_microstep: 1471.71 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1470.70 | step_microstep: 2.28
[2025-11-06 19:04:19,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 419.39 | bwd: 1472.55 | bwd_inner: 1.68 | bwd_allreduce: 1470.74 | step: 2.35
 91%|█████████▏| 3205/3507 [1:19:33<08:13,  1.63s/it]                                                     {'loss': 0.4324, 'learning_rate': 3.865913068852933e-07, 'epoch': 0.91}
 91%|█████████▏| 3205/3507 [1:19:33<08:13,  1.63s/it]tensor([[-6.4062, -4.7812,  0.0537,  1.3672, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6094, -3.7031, -2.0469,  2.7656, -0.0092]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:19,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.09 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9531,  2.4531,  3.1562, -3.0938, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0625, -2.7656,  1.4922,  0.8672, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4375, -1.4531,  2.3906, -0.0344, -3.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0000, -4.4688, -1.1797,  1.4531, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -4.1562, -0.3438,  1.7500, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -3.8750,  1.1328,  0.7266, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:04:20,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 19:04:20,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.89 | bwd_microstep: 974.84 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 973.52 | step_microstep: 2.67
[2025-11-06 19:04:20,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.02 | bwd: 975.79 | bwd_inner: 2.09 | bwd_allreduce: 973.56 | step: 2.75
 91%|█████████▏| 3206/3507 [1:19:34<07:51,  1.57s/it]                                                     {'loss': 0.5024, 'learning_rate': 3.840518262019299e-07, 'epoch': 0.91}
 91%|█████████▏| 3206/3507 [1:19:34<07:51,  1.57s/it]tensor([[-5.3750, -4.0625, -0.1206,  1.7109, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -4.7188, -1.3203,  2.7500, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2500, -3.3281,  0.1631,  4.0312, -0.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:21,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.13 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5781,  0.9961,  3.7656, -2.1875, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7031, -0.6328,  2.5156,  0.1641, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -3.7656,  0.6094,  1.3359, -3.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1562, -4.8438, -0.2070,  1.7656, -3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -1.3594,  2.5625, -0.3438, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:23,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.42 | optimizer_gradients: 0.22 | optimizer_step: 0.34
[2025-11-06 19:04:23,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.20 | bwd_microstep: 1943.56 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1942.29 | step_microstep: 11.52
[2025-11-06 19:04:23,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.37 | bwd: 1944.46 | bwd_inner: 1.96 | bwd_allreduce: 1942.34 | step: 11.60
 91%|█████████▏| 3207/3507 [1:19:37<09:04,  1.82s/it]                                                     {'loss': 0.3577, 'learning_rate': 3.815205505181741e-07, 'epoch': 0.91}
 91%|█████████▏| 3207/3507 [1:19:37<09:04,  1.82s/it]tensor([[-6.2500, -2.5312,  1.1719, -2.1250, -5.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -3.9062,  0.5625,  2.1250, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3281, -3.8750, -1.3750,  3.0156, -0.7227]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:23,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.52 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.4688, -3.7812,  1.4844,  0.8750, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2500,  1.4297,  4.4375, -1.5000, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.6797, -0.5625,  3.0781,  4.8438, -0.2793]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9062, -1.9219,  0.3223, -3.9062, -5.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -4.1875, -0.5156,  1.1953, -3.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:04:27,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.80 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 19:04:27,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.59 | bwd_microstep: 4075.41 | bwd_inner_microstep: 1.78 | bwd_allreduce_microstep: 4073.50 | step_microstep: 2.43
[2025-11-06 19:04:27,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.15 | bwd: 4076.33 | bwd_inner: 2.61 | bwd_allreduce: 4073.54 | step: 2.53
 91%|█████████▏| 3208/3507 [1:19:41<13:01,  2.61s/it]                                                     {'loss': 0.2331, 'learning_rate': 3.789974819938869e-07, 'epoch': 0.91}
 91%|█████████▏| 3208/3507 [1:19:41<13:01,  2.61s/it]tensor([[-4.2188, -4.6875, -1.8438,  2.5469, -1.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.4844, -3.2656, -2.9062,  0.5469, -0.3320]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:27,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.90 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0312, -3.1719,  0.4434,  0.5781, -3.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -4.5625, -1.2266,  1.4062, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -3.3281,  0.6328,  0.4824, -3.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.9844,  0.4121,  1.4219, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.1094, -0.4531,  3.8281,  2.9531, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -2.5781, -0.5078, -4.0000, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:04:28,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.23 | optimizer_step: 0.28
[2025-11-06 19:04:28,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.24 | bwd_microstep: 35.89 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 34.85 | step_microstep: 2.37
[2025-11-06 19:04:28,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.17 | bwd: 36.68 | bwd_inner: 1.61 | bwd_allreduce: 34.89 | step: 2.45
 92%|█████████▏| 3209/3507 [1:19:41<09:40,  1.95s/it]                                                     {'loss': 0.3507, 'learning_rate': 3.764826227819285e-07, 'epoch': 0.92}
 92%|█████████▏| 3209/3507 [1:19:41<09:40,  1.95s/it]tensor([[-5.1562, -5.1250, -1.6875,  2.1719, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -3.7500,  0.7031,  3.9062, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.7188,  1.5156,  3.8281, -1.3438, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:28,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 196.97 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-1.6328,  2.0781,  3.3594, -1.5625, -2.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5938,  1.4219,  2.9531, -2.2812, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.7188, -3.7344,  1.7344,  2.6562, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.2188, -0.2119,  2.5469, -2.2656, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.5312, -5.3125,  0.9258,  1.8125, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:04:28,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:04:28,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.05 | bwd_microstep: 101.60 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 100.42 | step_microstep: 1.58
[2025-11-06 19:04:28,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.05 | bwd: 102.39 | bwd_inner: 1.78 | bwd_allreduce: 100.46 | step: 1.67
 92%|█████████▏| 3210/3507 [1:19:42<07:32,  1.52s/it]                                                     {'loss': 0.4767, 'learning_rate': 3.7397597502815133e-07, 'epoch': 0.92}
 92%|█████████▏| 3210/3507 [1:19:42<07:32,  1.52s/it]tensor([[-6.8438, -4.1562,  1.5547,  1.0156, -5.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -2.3125,  2.4375,  0.5625, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -2.0781,  1.4219,  1.7188, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:28,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.29 | bwd_microstep: 3.73 | bwd_inner_microstep: 3.61 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.5938, -1.6797,  2.5938, -1.0938, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -5.4688, -2.8750,  2.1875, -1.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -2.2031,  2.6406,  0.5586, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -1.6797,  2.8594, -0.1064, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6250, -5.1875,  0.4922,  2.7188, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:29,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:04:29,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.08 | bwd_microstep: 170.13 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 169.26 | step_microstep: 1.98
[2025-11-06 19:04:29,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 432.38 | bwd: 173.87 | bwd_inner: 4.41 | bwd_allreduce: 169.30 | step: 2.07
 92%|█████████▏| 3211/3507 [1:19:43<06:13,  1.26s/it]                                                     {'loss': 0.5395, 'learning_rate': 3.714775408714033e-07, 'epoch': 0.92}
 92%|█████████▏| 3211/3507 [1:19:43<06:13,  1.26s/it]tensor([[-6.4062, -4.4375,  1.2266,  2.5938, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -4.1875, -0.5273,  3.1562, -1.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:29,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.16 | bwd_microstep: 4.18 | bwd_inner_microstep: 4.05 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-3.2188, -3.9375, -1.8359,  2.4844, -0.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2969, -0.3789,  2.3750, -0.1680, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -1.1406,  2.8438,  1.0703, -3.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.3125, -6.5000, -0.4297,  1.2734, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6406, -1.2422,  2.8438,  2.4688, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -1.4531,  2.0312,  1.0859, -2.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:04:31,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.26 | optimizer_step: 0.34
[2025-11-06 19:04:31,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.15 | bwd_microstep: 1951.84 | bwd_inner_microstep: 5.27 | bwd_allreduce_microstep: 1946.46 | step_microstep: 68.01
[2025-11-06 19:04:31,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.32 | bwd: 1956.02 | bwd_inner: 9.34 | bwd_allreduce: 1946.52 | step: 68.11
 92%|█████████▏| 3212/3507 [1:19:45<08:02,  1.64s/it]                                                     {'loss': 0.3307, 'learning_rate': 3.6898732244352143e-07, 'epoch': 0.92}
 92%|█████████▏| 3212/3507 [1:19:45<08:02,  1.64s/it]tensor([[-3.7188, -4.0938, -1.3281,  2.8281, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -3.6875,  0.2520,  2.1875, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -2.9062, -0.0190,  0.5508, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:32,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.54 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.1562,  0.0420,  3.1562, -1.8828, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.8125, -5.0000,  0.4746,  1.9922, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.9375, -4.6250,  1.2812,  1.8125, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -2.5469,  0.4121, -0.4922, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -1.9922,  2.3906,  2.0000, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:33,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:04:33,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.42 | bwd_microstep: 872.23 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 871.27 | step_microstep: 2.98
[2025-11-06 19:04:33,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.99 | bwd: 872.92 | bwd_inner: 1.44 | bwd_allreduce: 871.32 | step: 3.07
 92%|█████████▏| 3213/3507 [1:19:46<07:26,  1.52s/it]                                                     {'loss': 0.3552, 'learning_rate': 3.6650532186933817e-07, 'epoch': 0.92}
 92%|█████████▏| 3213/3507 [1:19:46<07:26,  1.52s/it]tensor([[-2.9375, -3.5156, -1.3516,  2.8750, -0.4121]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:33,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.89 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.5938, -3.7812,  0.2598,  0.4336, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.9844, -0.4375,  2.5938,  1.1250, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -4.6250, -0.2354,  2.0938, -3.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1250, -2.0469,  1.8125,  1.9297, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -3.5938,  0.6602,  3.2812, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -5.1250, -0.6797,  1.3906, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5625, -5.2812, -0.9062,  0.6953, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:04:35,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 19:04:35,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.35 | bwd_microstep: 2024.39 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 2023.24 | step_microstep: 2.11
[2025-11-06 19:04:35,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.24 | bwd: 2025.39 | bwd_inner: 1.91 | bwd_allreduce: 2023.30 | step: 2.22
 92%|█████████▏| 3214/3507 [1:19:49<08:43,  1.79s/it]                                                     {'loss': 0.2963, 'learning_rate': 3.640315412666662e-07, 'epoch': 0.92}
 92%|█████████▏| 3214/3507 [1:19:49<08:43,  1.79s/it]tensor([[-3.4375,  0.3457,  1.9609, -2.6719, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1250, -3.6250,  1.0703,  2.3594, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:35,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[ 0.2637,  3.7031,  2.8906, -1.5078, -1.0547]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.7188, -3.6719,  1.6953, -0.1953, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.7500, -5.2500,  0.5234,  2.6250, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.6562, -2.2812, -1.1406,  2.4844,  0.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -1.6016,  3.6406,  0.1001, -4.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -3.4219,  1.2500,  0.7188, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:04:36,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 19:04:36,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 75.76 | bwd_microstep: 943.10 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 941.99 | step_microstep: 34.33
[2025-11-06 19:04:36,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 277.55 | bwd: 944.20 | bwd_inner: 2.00 | bwd_allreduce: 942.04 | step: 34.43
 92%|█████████▏| 3215/3507 [1:19:50<07:58,  1.64s/it]                                                     {'loss': 0.5494, 'learning_rate': 3.6156598274630915e-07, 'epoch': 0.92}
 92%|█████████▏| 3215/3507 [1:19:50<07:58,  1.64s/it]tensor([[-3.9688,  0.5547,  3.8125, -1.5312, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:36,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.02 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4688, -2.9688, -1.3203,  2.6406, -0.1011]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -1.8672,  2.8594,  0.7109, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312,  0.4512,  4.2188, -0.2275, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3750, -5.1562,  0.4316,  0.8008, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -5.0312, -0.5352,  2.2656, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.3594, -3.8906, -0.8633,  3.7969, -0.5742]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0625, -3.1719,  1.8203,  0.6953, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:37,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.20 | optimizer_step: 0.30
[2025-11-06 19:04:37,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.97 | bwd_microstep: 2.71 | bwd_inner_microstep: 1.44 | bwd_allreduce_microstep: 1.16 | step_microstep: 2.35
[2025-11-06 19:04:37,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.00 | bwd: 3.56 | bwd_inner: 2.20 | bwd_allreduce: 1.20 | step: 2.43
 92%|█████████▏| 3216/3507 [1:19:51<06:11,  1.28s/it]                                                     {'loss': 0.4979, 'learning_rate': 3.591086484120543e-07, 'epoch': 0.92}
 92%|█████████▏| 3216/3507 [1:19:51<06:11,  1.28s/it]tensor([[-2.8594, -3.4531, -1.8672,  2.2969, -0.3809]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188, -3.8125,  0.2676,  1.7422, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:37,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.30 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-1.1094,  1.6797,  3.4688,  1.4141, -1.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -0.2061,  4.0000, -2.0000, -5.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5625, -2.8281,  2.9531,  0.1621, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -0.5703,  3.2656, -0.8789, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -5.1250,  0.4121,  1.8828, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -5.0938, -1.0469,  2.2656, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:04:39,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.21 | optimizer_step: 0.22
[2025-11-06 19:04:39,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.69 | bwd_microstep: 2327.61 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 2326.46 | step_microstep: 3.44
[2025-11-06 19:04:39,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.00 | bwd: 2328.46 | bwd_inner: 1.75 | bwd_allreduce: 2326.52 | step: 3.53
 92%|█████████▏| 3217/3507 [1:19:53<08:13,  1.70s/it]                                                     {'loss': 0.3025, 'learning_rate': 3.5665954036067207e-07, 'epoch': 0.92}
 92%|█████████▏| 3217/3507 [1:19:53<08:13,  1.70s/it]tensor([[-3.6406, -3.0312,  0.0051,  2.4062, -1.6641]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -5.0625, -1.1250,  2.3125, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:40,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.38 | bwd_microstep: 5.67 | bwd_inner_microstep: 5.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.19
tensor([[-9.1875, -7.1250, -0.7109,  0.4336, -6.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -1.9453,  2.4375, -0.7109, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -3.0625,  1.1719,  2.5312, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7812, -4.8438, -2.5781,  0.7031, -2.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -4.6875,  0.4688,  3.0000, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7031,  1.0234,  4.0000, -2.4844, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:04:40,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.24 | optimizer_step: 0.33
[2025-11-06 19:04:40,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.93 | bwd_microstep: 352.35 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 351.32 | step_microstep: 2.59
[2025-11-06 19:04:40,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 384.35 | bwd: 358.02 | bwd_inner: 6.39 | bwd_allreduce: 351.45 | step: 2.79
 92%|█████████▏| 3218/3507 [1:19:54<06:53,  1.43s/it]                                                     {'loss': 0.1178, 'learning_rate': 3.5421866068191315e-07, 'epoch': 0.92}
 92%|█████████▏| 3218/3507 [1:19:54<06:53,  1.43s/it]tensor([[-6.5625, -4.4688,  0.5859,  1.2734, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:40,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.95 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2812, -4.0312, -0.2930,  3.1250, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -3.3750,  0.4258,  3.6406, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5312, -2.2188,  1.8984,  3.5625, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-8.1250, -6.0625,  0.4375,  1.7422, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -0.9844,  3.0625, -1.3594, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -1.5312,  1.0078, -2.0781, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -5.4375, -0.8555,  2.1094, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:04:44,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 19:04:44,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.35 | bwd_microstep: 3573.53 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 3572.60 | step_microstep: 2.08
[2025-11-06 19:04:44,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.33 | bwd: 3574.40 | bwd_inner: 1.60 | bwd_allreduce: 3572.65 | step: 2.17
 92%|█████████▏| 3219/3507 [1:19:58<10:32,  2.20s/it]                                                     {'loss': 0.4353, 'learning_rate': 3.517860114585037e-07, 'epoch': 0.92}
 92%|█████████▏| 3219/3507 [1:19:58<10:32,  2.20s/it]tensor([[-6.0000, -5.7188, -2.0312,  1.5469, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -3.7344,  0.7031, -0.4883, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -5.1250, -0.8008,  2.3594, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:44,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.87 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.8750, -5.6875, -1.2812,  2.8750, -2.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.5312, -5.5625,  0.3457,  1.6172, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -1.4062,  1.5859, -1.9766, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0000, -4.4375, -0.4766,  0.6445, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7188,  0.1787,  3.1094, -1.1094, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:04:45,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 19:04:45,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.68 | bwd_microstep: 120.55 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 119.61 | step_microstep: 2.09
[2025-11-06 19:04:45,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 298.57 | bwd: 121.34 | bwd_inner: 1.55 | bwd_allreduce: 119.66 | step: 2.16
 92%|█████████▏| 3220/3507 [1:19:58<08:00,  1.68s/it]                                                     {'loss': 0.274, 'learning_rate': 3.4936159476615216e-07, 'epoch': 0.92}
 92%|█████████▏| 3220/3507 [1:19:58<08:00,  1.68s/it]tensor([[-3.3281, -1.3359,  1.6016,  1.2891, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3438, -4.3750,  1.6719,  0.7695, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:45,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.45 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.5938, -4.2812,  0.3789,  2.2812, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.4375,  1.2578,  0.3242, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -4.9688, -0.0703,  0.9648, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6094, -2.9688, -1.7812,  1.4688, -0.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.3438, -2.5469,  2.2812, -1.1172, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0938, -2.4844,  2.3438, -0.3789, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:47,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:04:47,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.84 | bwd_microstep: 2003.35 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 2002.29 | step_microstep: 2.21
[2025-11-06 19:04:47,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.29 | bwd: 2004.37 | bwd_inner: 1.85 | bwd_allreduce: 2002.36 | step: 2.31
 92%|█████████▏| 3221/3507 [1:20:01<08:59,  1.88s/it]                                                     {'loss': 0.6204, 'learning_rate': 3.4694541267354165e-07, 'epoch': 0.92}
 92%|█████████▏| 3221/3507 [1:20:01<08:59,  1.88s/it]tensor([[-5.5000, -1.3984,  2.7344, -1.1094, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:47,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.23 | bwd_microstep: 1.23 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.2812, -3.8750, -2.1562,  2.0938, -0.6523]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.9062, -1.5000,  2.9062, -0.1104, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2500, -0.4102,  4.0625, -1.7656, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -4.3125,  1.0391,  1.5156, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -5.0312, -1.2891,  2.7500, -2.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -2.9375,  2.0938,  3.0469, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.5078,  2.4062,  1.9609, -1.2891, -1.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:04:48,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 19:04:48,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.44 | bwd_microstep: 140.97 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 140.17 | step_microstep: 2.26
[2025-11-06 19:04:48,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.69 | bwd: 142.20 | bwd_inner: 1.82 | bwd_allreduce: 140.22 | step: 2.36
 92%|█████████▏| 3222/3507 [1:20:01<07:00,  1.48s/it]                                                     {'loss': 0.6575, 'learning_rate': 3.445374672423252e-07, 'epoch': 0.92}
 92%|█████████▏| 3222/3507 [1:20:01<07:00,  1.48s/it]tensor([[-4.4688, -0.1768,  3.2188, -1.4609, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438, -4.2812,  1.5078,  1.3906, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8281,  0.4258,  1.7734, -1.5156, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -1.3984,  2.9375,  0.8086, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:48,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.40 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.1875,  0.5859,  2.6250, -1.5781, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4375, -3.2812,  2.2656,  0.5508, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250,  0.2461,  3.5781, -1.5938, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.7969,  0.4590,  2.5156, -0.5273, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:50,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.31 | optimizer_step: 0.44
[2025-11-06 19:04:50,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.87 | bwd_microstep: 1990.02 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1988.98 | step_microstep: 3.52
[2025-11-06 19:04:50,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 455.29 | bwd: 1990.78 | bwd_inner: 1.55 | bwd_allreduce: 1989.06 | step: 3.62
 92%|█████████▏| 3223/3507 [1:20:04<08:25,  1.78s/it]                                                     {'loss': 0.215, 'learning_rate': 3.421377605271325e-07, 'epoch': 0.92}
 92%|█████████▏| 3223/3507 [1:20:04<08:25,  1.78s/it]tensor([[-4.7188, -4.7500, -0.8672,  3.0625, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.5469, -2.3750, -1.1797,  2.8750,  0.6172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-9.5625, -6.8438, -1.3594, -1.8750, -7.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:50,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.32 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.2188, -4.4062,  0.0708,  2.9219, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344, -3.9219, -1.5078,  2.0000, -1.2891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.8438, -3.7812, -1.9922,  2.8438, -0.1099]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9062, -4.7500,  0.1865,  0.4785, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3750,  0.6133,  3.1875, -1.7578, -3.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:04:50,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:04:50,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.95 | bwd_microstep: 1.79 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.77 | step_microstep: 1.50
[2025-11-06 19:04:50,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.29 | bwd: 2.70 | bwd_inner: 1.76 | bwd_allreduce: 0.81 | step: 1.58
 92%|█████████▏| 3224/3507 [1:20:04<06:26,  1.37s/it]                                                     {'loss': 1.1108, 'learning_rate': 3.397462945755603e-07, 'epoch': 0.92}
 92%|█████████▏| 3224/3507 [1:20:04<06:26,  1.37s/it]tensor([[-4.9062, -2.2812,  1.8516,  0.5195, -3.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:51,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.20 | bwd_microstep: 3.88 | bwd_inner_microstep: 3.74 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.0938, -5.0625, -0.3340,  2.1406, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -0.9844,  1.9375, -1.4531, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3906,  1.2891,  3.3125, -2.9531, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4688, -3.5469,  2.0312,  1.1562, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.2969,  0.4648,  1.8984, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -3.5469, -0.0674,  2.8906, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750,  0.8086,  1.8828, -2.5312, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 19:04:52,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:04:52,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.64 | bwd_microstep: 1488.54 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1487.53 | step_microstep: 2.05
[2025-11-06 19:04:52,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.86 | bwd: 1492.42 | bwd_inner: 4.68 | bwd_allreduce: 1487.59 | step: 2.15
 92%|█████████▏| 3225/3507 [1:20:06<07:06,  1.51s/it]                                                     {'loss': 0.5021, 'learning_rate': 3.373630714281739e-07, 'epoch': 0.92}
 92%|█████████▏| 3225/3507 [1:20:06<07:06,  1.51s/it]tensor([[-7.4688, -6.0625, -0.4082,  2.1406, -4.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -2.8750,  2.6250,  1.2422, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -2.5625,  1.6719,  2.6094, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:52,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.46 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.2969, -0.6680,  2.6094,  0.5898, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -2.9688,  1.1797,  0.2559, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -3.7969, -2.3125,  1.3047, -0.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5625, -4.2812, -0.1348,  1.8906, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4375, -2.4688,  1.7656,  0.0315, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:04:53,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:04:53,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.97 | bwd_microstep: 116.29 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 115.10 | step_microstep: 1.75
[2025-11-06 19:04:53,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.45 | bwd: 117.30 | bwd_inner: 2.01 | bwd_allreduce: 115.15 | step: 1.84
 92%|█████████▏| 3226/3507 [1:20:07<05:38,  1.21s/it]                                                     {'loss': 0.2015, 'learning_rate': 3.3498809311850677e-07, 'epoch': 0.92}
 92%|█████████▏| 3226/3507 [1:20:07<05:38,  1.21s/it]tensor([[-3.4375, -0.7461,  2.2969,  0.5117, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:53,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.50 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.2500, -5.5000, -1.4688,  1.2812, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8438, -5.2500, -0.3770,  1.0781, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625, -3.7656, -0.5898,  1.4141, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2656, -4.1875, -2.5156,  2.0781, -0.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5781,  0.2344,  2.1250, -0.7578, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7344, -4.3750, -2.3906,  1.9922, -0.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -0.7539,  1.9297, -3.8125, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:56,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 19:04:56,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.77 | bwd_microstep: 3340.78 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 3339.88 | step_microstep: 2.92
[2025-11-06 19:04:56,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.30 | bwd: 3341.62 | bwd_inner: 1.50 | bwd_allreduce: 3339.94 | step: 3.02
 92%|█████████▏| 3227/3507 [1:20:10<09:08,  1.96s/it]                                                     {'loss': 0.1329, 'learning_rate': 3.326213616730578e-07, 'epoch': 0.92}
 92%|█████████▏| 3227/3507 [1:20:10<09:08,  1.96s/it]tensor([[-5.5625, -4.1562,  0.6562,  2.2969, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -4.8750, -0.3770,  2.9375, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6250,  0.4258,  3.0469,  0.6211, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8906, -4.6250, -2.4219,  2.0469, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -1.1641,  2.2656,  0.5781, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:04:57,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 258.84 | bwd_microstep: 1.34 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-4.0938, -2.8906,  1.4141,  3.5000, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -4.4375, -0.8281,  2.1562, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3594,  1.1875,  2.8125, -1.2891, -2.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:57,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 19:04:57,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.58 | bwd_microstep: 26.62 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 25.41 | step_microstep: 2.19
[2025-11-06 19:04:57,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.45 | bwd: 27.96 | bwd_inner: 2.35 | bwd_allreduce: 25.46 | step: 2.28
 92%|█████████▏| 3228/3507 [1:20:11<07:04,  1.52s/it]                                                     {'loss': 0.1278, 'learning_rate': 3.3026287911128383e-07, 'epoch': 0.92}
 92%|█████████▏| 3228/3507 [1:20:11<07:04,  1.52s/it]tensor([[-4.1875, -3.4062,  0.2754,  2.6875, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:57,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.49 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.6875,  1.0078,  4.1562, -2.2031, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -2.4531,  1.2969, -0.4727, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -5.5625, -1.6562,  1.2578, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.0312, -0.7422,  2.2969,  5.3125,  0.7539]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -5.2188, -0.7227,  2.2031, -3.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -3.1562,  1.6875,  1.6953, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125, -0.0269,  2.9375,  0.1143, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:04:58,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:04:58,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.58 | bwd_microstep: 371.37 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 370.36 | step_microstep: 1.97
[2025-11-06 19:04:58,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.09 | bwd: 372.23 | bwd_inner: 1.70 | bwd_allreduce: 370.40 | step: 2.04
 92%|█████████▏| 3229/3507 [1:20:11<05:54,  1.28s/it]                                                     {'loss': 0.1643, 'learning_rate': 3.279126474456129e-07, 'epoch': 0.92}
 92%|█████████▏| 3229/3507 [1:20:11<05:54,  1.28s/it]tensor([[-2.8125,  1.2500,  2.9844, -2.0312, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -4.8125, -2.5000,  2.2344, -1.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.4688, -4.2188, -0.4746,  3.1250, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:58,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.65 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-2.5000,  1.3047,  3.4531, -0.9766, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7812,  0.2852,  2.4531, -0.5234, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9375, -6.7500, -2.2031,  1.9141, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3906,  1.0938,  3.5000, -2.6875, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -2.4219, -0.9805, -4.2500, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 19:04:59,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 19:04:59,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.87 | bwd_microstep: 582.93 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 581.84 | step_microstep: 1.57
[2025-11-06 19:04:59,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.56 | bwd: 583.93 | bwd_inner: 1.89 | bwd_allreduce: 581.90 | step: 1.68
 92%|█████████▏| 3230/3507 [1:20:12<05:30,  1.19s/it]                                                     {'loss': 0.72, 'learning_rate': 3.25570668681422e-07, 'epoch': 0.92}
 92%|█████████▏| 3230/3507 [1:20:12<05:30,  1.19s/it]tensor([[-5.7188, -4.1875,  0.1934,  1.5625, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:04:59,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.67 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.12
tensor([[-5.9688, -4.0000,  0.6680,  1.3906, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -4.6875, -1.5781,  2.8906, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.2812, -0.6250,  2.1406, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -1.2500,  2.7812,  0.6172, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.5000, -6.6875, -0.1553,  1.8203, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5312, -6.9062, -2.2500,  1.3203, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -6.0938, -2.5625,  1.5781, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:05:00,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.87 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 19:05:00,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.41 | bwd_microstep: 783.92 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 782.82 | step_microstep: 4.06
[2025-11-06 19:05:00,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 340.11 | bwd: 785.03 | bwd_inner: 1.99 | bwd_allreduce: 782.89 | step: 4.18
 92%|█████████▏| 3231/3507 [1:20:14<05:26,  1.18s/it]                                                     {'loss': 0.1316, 'learning_rate': 3.232369448170525e-07, 'epoch': 0.92}
 92%|█████████▏| 3231/3507 [1:20:14<05:26,  1.18s/it]tensor([[-6.9062, -4.3750,  0.5195,  0.0674, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -0.6211,  3.6719, -1.7500, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -3.0469,  2.0000, -0.5312, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:01,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.42 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.6250, -2.6719,  0.3672,  4.2188, -0.3398]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -2.6562,  1.2422, -0.3965, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1250, -2.5000,  2.4375, -0.4805, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -2.4688,  0.3516,  0.5547, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -4.5000, -1.0703,  3.0312, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:02,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.29 | optimizer_step: 0.31
[2025-11-06 19:05:02,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.06 | bwd_microstep: 601.84 | bwd_inner_microstep: 1.94 | bwd_allreduce_microstep: 599.73 | step_microstep: 2.77
[2025-11-06 19:05:02,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 343.47 | bwd: 602.81 | bwd_inner: 2.85 | bwd_allreduce: 599.78 | step: 2.85
 92%|█████████▏| 3232/3507 [1:20:16<06:21,  1.39s/it]                                                     {'loss': 0.2526, 'learning_rate': 3.209114778438027e-07, 'epoch': 0.92}
 92%|█████████▏| 3232/3507 [1:20:16<06:21,  1.39s/it]tensor([[-5.0312, -3.7812,  0.2080,  1.8125, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:02,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.92 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.2812, -6.2500, -1.3984,  1.1641, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7500, -4.1250, -0.8203,  3.6094, -0.9727]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -3.3906,  0.8984,  2.2031, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -2.9844,  1.2266,  1.1953, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8281, -2.7656, -2.7031,  0.9336,  0.2598]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.7812, -5.4375, -2.8594,  1.6953, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.4844,  0.7891,  2.7344, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:05:03,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.59 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:05:03,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.89 | bwd_microstep: 1002.35 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1001.24 | step_microstep: 4.52
[2025-11-06 19:05:03,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.83 | bwd: 1003.26 | bwd_inner: 1.84 | bwd_allreduce: 1001.27 | step: 4.59
 92%|█████████▏| 3233/3507 [1:20:17<06:22,  1.40s/it]                                                     {'loss': 0.3257, 'learning_rate': 3.1859426974592323e-07, 'epoch': 0.92}
 92%|█████████▏| 3233/3507 [1:20:17<06:22,  1.40s/it]tensor([[-7.1875, -5.4375, -0.0610,  1.1953, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -5.7500, -1.2812,  1.7500, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -1.0859,  3.2969,  0.1670, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -2.4062,  2.1875,  1.0312, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2500, -4.5938, -0.0801, -1.0469, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -1.6016,  3.1875,  0.7969, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:03,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 329.37 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4219, -1.3984,  1.5781,  1.4141, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.4688, -3.8594,  2.2812,  2.1094, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:05:05,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.27 | optimizer_step: 0.31
[2025-11-06 19:05:05,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 96.37 | bwd_microstep: 1240.65 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1239.43 | step_microstep: 2.82
[2025-11-06 19:05:05,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 425.76 | bwd: 1241.64 | bwd_inner: 1.98 | bwd_allreduce: 1239.49 | step: 2.92
 92%|█████████▏| 3234/3507 [1:20:19<06:47,  1.49s/it]                                                     {'loss': 0.2931, 'learning_rate': 3.162853225006168e-07, 'epoch': 0.92}
 92%|█████████▏| 3234/3507 [1:20:19<06:47,  1.49s/it]tensor([[-5.0938, -0.9180,  2.8281, -1.5781, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.1094, -2.9062, -2.3750,  1.1328, -0.0386]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:05,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.74 | bwd_microstep: 1.61 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08
tensor([[-3.6875, -0.1245,  2.5781, -1.1484, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -2.5781,  0.3965,  1.8672, -2.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6875,  1.0000,  3.5938, -0.9141, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7500, -3.5938,  0.4902,  0.5664, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.3125, -4.3438,  0.4727,  1.3125, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9375, -3.6562,  2.0938,  0.2217, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:05,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:05:05,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.15 | bwd_microstep: 2.27 | bwd_inner_microstep: 1.38 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.61
[2025-11-06 19:05:05,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 420.92 | bwd: 3.87 | bwd_inner: 2.84 | bwd_allreduce: 0.87 | step: 2.69
 92%|█████████▏| 3235/3507 [1:20:19<05:22,  1.19s/it]                                                     {'loss': 0.2548, 'learning_rate': 3.139846380780387e-07, 'epoch': 0.92}
 92%|█████████▏| 3235/3507 [1:20:19<05:22,  1.19s/it]tensor([[-5.9688, -1.7500,  3.7188, -0.0967, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5938, -4.1875,  1.6875,  1.9922, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.9688, -5.8750, -0.0210,  0.8125, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -0.4824,  2.6094, -0.7734, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -2.1406,  2.2031,  1.7422, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6875, -3.3750,  1.5781, -2.8125, -7.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0625, -5.2812,  0.5000,  1.9219, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:06,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.59 | bwd_microstep: 1.51 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.20 | step_microstep: 0.30
tensor([[-6.5000, -3.2031,  2.5469,  0.8008, -5.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:07,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.21 | optimizer_step: 0.21
[2025-11-06 19:05:07,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 332.67 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.79 | step_microstep: 203.12
[2025-11-06 19:05:07,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 479.29 | bwd: 3.36 | bwd_inner: 1.99 | bwd_allreduce: 1.01 | step: 203.42
 92%|█████████▏| 3236/3507 [1:20:21<05:50,  1.29s/it]                                                     {'loss': 0.8259, 'learning_rate': 3.1169221844129517e-07, 'epoch': 0.92}
 92%|█████████▏| 3236/3507 [1:20:21<05:50,  1.29s/it]tensor([[-2.0625,  1.7891,  2.4375, -2.4688, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-2.8906, -3.3438, -1.3828,  2.2812, -0.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 19:05:07,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.07 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.10
tensor([[-1.6641,  1.6797,  2.6406, -1.1016, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5625, -2.3750,  3.5156, -0.3242, -5.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7188, -2.3594, -1.5078,  2.1875,  0.3672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -5.8750, -1.0938,  2.9531, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0625, -0.9727,  1.5078,  0.7227, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.0625, -0.2617,  3.4219,  4.0312, -0.9180]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:05:09,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.59 | optimizer_gradients: 0.22 | optimizer_step: 0.30
[2025-11-06 19:05:09,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.81 | bwd_microstep: 1552.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1552.00 | step_microstep: 4.34
[2025-11-06 19:05:09,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 293.88 | bwd: 1553.62 | bwd_inner: 1.42 | bwd_allreduce: 1552.06 | step: 4.44
 92%|█████████▏| 3237/3507 [1:20:23<06:37,  1.47s/it]                                                     {'loss': 0.7719, 'learning_rate': 3.094080655464382e-07, 'epoch': 0.92}
 92%|█████████▏| 3237/3507 [1:20:23<06:37,  1.47s/it]tensor([[-5.4688, -5.0000, -0.7930,  2.5938, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -2.2500,  2.0625,  2.6875, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -2.1250,  3.0469, -1.1250, -5.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9688, -4.8750,  0.2969,  1.0547, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:09,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.69 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4375, -5.3438, -1.6406,  2.1719, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -5.0938, -0.5430,  2.6406, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.8750, -5.5938, -0.4746,  1.8516, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.9062, -4.7500,  1.5703,  0.3184, -6.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:09,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 19:05:09,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.11 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 0.94 | step_microstep: 3.98
[2025-11-06 19:05:09,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.82 | bwd: 2.99 | bwd_inner: 1.84 | bwd_allreduce: 0.99 | step: 4.08
 92%|█████████▏| 3238/3507 [1:20:23<05:19,  1.19s/it]                                                     {'loss': 0.1641, 'learning_rate': 3.071321813424666e-07, 'epoch': 0.92}
 92%|█████████▏| 3238/3507 [1:20:23<05:19,  1.19s/it]tensor([[-3.5938, -3.5938, -1.2969,  1.8594, -1.3516]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5938, -3.6250,  2.0625,  0.8281, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:09,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.31 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.5000, -5.5000,  0.1191,  3.3281, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.6562, -0.1289,  2.1250, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -4.3750, -0.0223,  2.4375, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.1875, -4.4375,  1.1328, -1.5781, -6.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -4.9062, -0.4902,  2.4219, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.2500,  0.1943,  1.6953, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:05:12,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 19:05:12,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.45 | bwd_microstep: 1841.99 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 1840.40 | step_microstep: 2.54
[2025-11-06 19:05:12,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.80 | bwd: 1842.66 | bwd_inner: 2.07 | bwd_allreduce: 1840.44 | step: 2.63
 92%|█████████▏| 3239/3507 [1:20:25<06:45,  1.51s/it]                                                     {'loss': 0.4577, 'learning_rate': 3.0486456777132465e-07, 'epoch': 0.92}
 92%|█████████▏| 3239/3507 [1:20:25<06:45,  1.51s/it]tensor([[-4.3125, -0.1328,  3.9219, -0.6133, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -3.7812,  0.6680,  2.5156, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6250, -4.2188, -0.1709,  2.9531, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4375, -4.0000,  0.2217,  1.5000, -3.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -2.0938,  2.9062, -1.1016, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0625,  1.2344,  3.2656, -1.9141, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:13,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.20 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.9844, -1.2891,  1.9688,  0.5820, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2500, -4.6875, -0.9141,  2.2344, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:05:13,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.86 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:05:13,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.94 | bwd_microstep: 34.03 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 33.18 | step_microstep: 2.82
[2025-11-06 19:05:13,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.15 | bwd: 34.75 | bwd_inner: 1.40 | bwd_allreduce: 33.21 | step: 2.91
 92%|█████████▏| 3240/3507 [1:20:27<07:01,  1.58s/it]                                                     {'loss': 0.2927, 'learning_rate': 3.026052267678981e-07, 'epoch': 0.92}
 92%|█████████▏| 3240/3507 [1:20:27<07:01,  1.58s/it]tensor([[-6.3750, -3.2344,  2.5625,  1.2812, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[0.6172, 1.7031, 4.6250, 6.1875, 1.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:13,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.10 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.9375,  1.2969,  3.4688, -2.0000, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.1250, -4.8750,  0.1836, -1.7422, -6.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -2.9375,  0.8438,  1.6328, -2.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1562, -3.2812,  2.1250,  0.9922, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -2.6094,  1.4141,  1.1797, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5156, -3.4219, -1.8828,  2.7344,  0.0302]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:05:16,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.85 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 19:05:16,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.35 | bwd_microstep: 2038.83 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 2037.77 | step_microstep: 2.87
[2025-11-06 19:05:16,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 493.49 | bwd: 2039.72 | bwd_inner: 1.72 | bwd_allreduce: 2037.83 | step: 2.98
 92%|█████████▏| 3241/3507 [1:20:30<08:20,  1.88s/it]                                                     {'loss': 0.7574, 'learning_rate': 3.0035416026001573e-07, 'epoch': 0.92}
 92%|█████████▏| 3241/3507 [1:20:30<08:20,  1.88s/it]tensor([[-3.1719, -1.1406, -0.4180, -1.3672, -2.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2812, -3.5156, -0.0903,  2.1562, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5781, -0.7500,  2.9062,  1.3516, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.0000, -3.0000, -1.5625,  0.8945, -1.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:16,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.01 | bwd_microstep: 2.58 | bwd_inner_microstep: 2.45 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.8125, -2.5781,  1.8828,  1.8281, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -3.0938,  1.2656,  2.4531, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.9219, -3.6406, -0.9922,  1.8750, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -3.5000,  0.5312,  2.8594, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:18,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:05:18,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.35 | bwd_microstep: 1.78 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.44
[2025-11-06 19:05:18,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.37 | bwd: 4.36 | bwd_inner: 3.36 | bwd_allreduce: 0.84 | step: 2.54
 92%|█████████▏| 3242/3507 [1:20:31<08:06,  1.83s/it]                                                     {'loss': 0.3939, 'learning_rate': 2.9811137016844347e-07, 'epoch': 0.92}
 92%|█████████▏| 3242/3507 [1:20:31<08:06,  1.83s/it]tensor([[-5.2500, -4.0625, -0.0074,  2.1250, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8281, -3.1094, -0.3320,  1.6875, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.0781,  1.5156,  2.2656, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:18,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.79 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3125, -5.3125, -1.5391,  2.3438, -2.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9844, -3.4844, -0.7695,  3.7656, -0.3535]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3125, -3.2188, -2.3594,  1.6016, -0.0130]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8438, -2.3125,  1.0312, -0.3867, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -4.1875, -0.8750,  2.2656, -2.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:05:19,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 19:05:19,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.02 | bwd_microstep: 982.34 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 981.18 | step_microstep: 2.70
[2025-11-06 19:05:19,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.82 | bwd: 983.26 | bwd_inner: 1.86 | bwd_allreduce: 981.23 | step: 2.79
 92%|█████████▏| 3243/3507 [1:20:33<07:28,  1.70s/it]                                                     {'loss': 0.4283, 'learning_rate': 2.9587685840688716e-07, 'epoch': 0.92}
 92%|█████████▏| 3243/3507 [1:20:33<07:28,  1.70s/it]tensor([[-3.3906,  0.5938,  3.5938, -1.1016, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -1.9922,  2.4062,  0.2295, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -2.7188,  0.7383,  1.6172, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.3418, -1.2734, -0.3164,  4.0312,  1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.5938,  1.5625,  3.3906, -2.1406, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -2.7812,  1.1172,  1.9297, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7969, -3.3125, -0.9180,  3.3438, -0.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:21,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.77 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.0938, -3.9062, -0.7578,  2.7031, -1.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:21,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 19:05:21,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.65 | bwd_microstep: 2.19 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.25
[2025-11-06 19:05:21,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.44 | bwd: 3.21 | bwd_inner: 2.07 | bwd_allreduce: 0.99 | step: 2.35
 93%|█████████▎| 3244/3507 [1:20:35<08:33,  1.95s/it]                                                     {'loss': 0.7296, 'learning_rate': 2.936506268819894e-07, 'epoch': 0.93}
 93%|█████████▎| 3244/3507 [1:20:35<08:33,  1.95s/it]tensor([[-2.2500, -3.3281, -1.7031,  3.1406,  0.3496]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:22,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.65 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.1562, -5.1250,  0.0569,  0.7188, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0938, -5.0000,  1.1016,  1.9844, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -3.6562,  0.1553, -1.4453, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -4.2500,  0.8125,  2.7188, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.7188, -4.5312,  0.6523,  1.0547, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4062, -3.5000,  0.6406,  1.2188, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -4.0312, -0.5117,  2.2656, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:05:22,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 19:05:22,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.29 | bwd_microstep: 538.29 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 537.13 | step_microstep: 1.93
[2025-11-06 19:05:22,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 411.97 | bwd: 539.27 | bwd_inner: 1.94 | bwd_allreduce: 537.18 | step: 2.01
 93%|█████████▎| 3245/3507 [1:20:36<07:16,  1.66s/it]                                                     {'loss': 0.3296, 'learning_rate': 2.9143267749332626e-07, 'epoch': 0.93}
 93%|█████████▎| 3245/3507 [1:20:36<07:16,  1.66s/it]tensor([[-5.3750, -1.0078,  2.3125, -2.7188, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.0972, -1.2969, -1.0156,  3.2500,  1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.3750, -2.7969,  0.9297,  1.8359, -2.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:23,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.95 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -5.2812, -0.1533,  3.5625, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -3.9688, -0.0859,  2.8750, -2.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4531, -4.0000, -1.6719,  2.3750, -0.9414]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750, -3.9688, -1.5312,  2.9844, -0.6953]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.2500, -2.9375,  2.2812,  0.2637, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:24,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:05:24,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.38 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.90 | step_microstep: 1.91
[2025-11-06 19:05:24,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.33 | bwd: 2.71 | bwd_inner: 1.65 | bwd_allreduce: 0.93 | step: 1.99
 93%|█████████▎| 3246/3507 [1:20:38<06:43,  1.55s/it]                                                     {'loss': 0.7506, 'learning_rate': 2.892230121334083e-07, 'epoch': 0.93}
 93%|█████████▎| 3246/3507 [1:20:38<06:43,  1.55s/it]tensor([[-5.1875, -2.1562,  1.1641, -0.9727, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7500, -0.7344,  1.4062,  0.6484, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -4.6250, -0.5273,  3.5312, -1.8047]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:24,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.34 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-0.8047,  2.9219,  3.7344, -0.8203, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6562, -4.5625, -0.9883,  2.4688, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -6.7500, -2.1719,  1.9688, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -1.2188,  1.4766, -2.5938, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2500, -4.2812, -3.6406,  0.4355, -0.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 19:05:27,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.77 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 19:05:27,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.75 | bwd_microstep: 2497.26 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 2495.91 | step_microstep: 2.71
[2025-11-06 19:05:27,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 380.12 | bwd: 2498.25 | bwd_inner: 2.16 | bwd_allreduce: 2495.95 | step: 2.79
 93%|█████████▎| 3247/3507 [1:20:40<08:28,  1.96s/it]                                                     {'loss': 0.5201, 'learning_rate': 2.8702163268767294e-07, 'epoch': 0.93}
 93%|█████████▎| 3247/3507 [1:20:40<08:28,  1.96s/it]tensor([[-7.5625, -6.0938,  0.0645,  2.6875, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -4.3438, -0.9883,  2.5781, -1.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4062, -0.8516,  1.6719, -1.8750, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:05:27,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.57 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.4062, -2.8281,  1.4766,  2.6719, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -4.1562, -0.4043,  2.6250, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8594, -4.0625, -1.5469,  1.9766, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -1.4688,  3.6875, -0.5391, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.5625, -4.2500,  1.4922,  1.8828, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:27,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:05:27,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.04 | bwd_microstep: 2.79 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 1.59 | step_microstep: 1.97
[2025-11-06 19:05:27,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 461.64 | bwd: 3.71 | bwd_inner: 1.96 | bwd_allreduce: 1.63 | step: 2.06
 93%|█████████▎| 3248/3507 [1:20:41<06:35,  1.53s/it]                                                     {'loss': 0.4586, 'learning_rate': 2.848285410344953e-07, 'epoch': 0.93}
 93%|█████████▎| 3248/3507 [1:20:41<06:35,  1.53s/it]tensor([[-1.7812,  1.7031,  2.8750, -1.1875, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5625, -4.3750, -0.0223,  2.1094, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:05:27,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.31 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.2188, -5.0625, -1.1875,  2.8125, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -5.4688, -1.9219,  3.2812, -1.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -2.2344,  1.4922,  0.1387, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -1.6719,  2.6875, -0.0139, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6562, -2.3125,  2.3750, -2.2812, -6.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.7656,  1.6641,  1.8438, -2.1875, -2.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:05:30,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:05:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 335.07 | bwd_microstep: 2221.60 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 2220.42 | step_microstep: 2.07
[2025-11-06 19:05:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 522.41 | bwd: 2222.42 | bwd_inner: 1.83 | bwd_allreduce: 2220.46 | step: 2.14
 93%|█████████▎| 3249/3507 [1:20:44<08:11,  1.91s/it]                                                     {'loss': 0.1796, 'learning_rate': 2.8264373904517307e-07, 'epoch': 0.93}
 93%|█████████▎| 3249/3507 [1:20:44<08:11,  1.91s/it]tensor([[-5.5625, -4.9375, -0.9922,  1.9844, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.5938, -0.6992,  1.7266, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312,  0.4180,  3.4688, -0.7617, -3.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:30,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.33 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0938, -3.1875,  0.7930,  0.8359, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1562, -4.7500,  0.6758,  2.5781, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.1875, -3.9375, -0.1992, -0.6836, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -4.8750, -1.3594,  2.9375, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -6.1875, -1.8750,  2.6719, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:30,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:05:30,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.59 | bwd_microstep: 41.04 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 40.04 | step_microstep: 1.46
[2025-11-06 19:05:30,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.95 | bwd: 42.05 | bwd_inner: 1.85 | bwd_allreduce: 40.08 | step: 1.55
 93%|█████████▎| 3250/3507 [1:20:44<06:15,  1.46s/it]                                                     {'loss': 0.264, 'learning_rate': 2.804672285839316e-07, 'epoch': 0.93}
 93%|█████████▎| 3250/3507 [1:20:44<06:15,  1.46s/it]tensor([[-5.0625, -3.6094,  0.7539,  2.3125, -2.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.2812, -4.5938,  1.9141,  1.6016, -5.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -1.5547,  2.3125, -0.2432, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:31,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.14 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8750, -5.4688, -1.8438,  1.4297, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8594, -1.8047,  1.1797,  0.5938, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.8086,  2.2500,  2.5156, -0.6211, -1.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[0.1084, 0.6445, 1.9531, 3.2031, 0.8086]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3750,  1.2109,  3.4375, -2.6875, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:05:31,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:05:31,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 284.61 | bwd_microstep: 135.61 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 134.82 | step_microstep: 2.11
[2025-11-06 19:05:31,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 481.78 | bwd: 136.59 | bwd_inner: 1.60 | bwd_allreduce: 134.85 | step: 2.18
 93%|█████████▎| 3251/3507 [1:20:45<05:12,  1.22s/it]                                                     {'loss': 0.5836, 'learning_rate': 2.7829901150792205e-07, 'epoch': 0.93}
 93%|█████████▎| 3251/3507 [1:20:45<05:12,  1.22s/it]tensor([[-2.6250, -3.5156, -1.1719,  3.7969,  0.0669]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9375, -3.6875, -1.4844,  3.1562, -0.2852]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:31,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.87 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1250, -5.4062, -0.9023,  2.2188, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1250, -2.9375,  0.8008,  0.3223, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.7188, -4.7188,  1.0078, -0.1797, -5.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.1875, -5.0625,  0.6562,  1.2891, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -4.5938, -1.2578,  2.9219, -1.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5000, -3.5469,  1.3359,  1.7500, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:34,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.18 | optimizer_step: 0.29
[2025-11-06 19:05:34,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.09 | bwd_microstep: 3.27 | bwd_inner_microstep: 2.14 | bwd_allreduce_microstep: 1.03 | step_microstep: 3.29
[2025-11-06 19:05:34,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.99 | bwd: 3.93 | bwd_inner: 2.71 | bwd_allreduce: 1.07 | step: 3.37
 93%|█████████▎| 3252/3507 [1:20:47<06:49,  1.61s/it]                                                     {'loss': 0.2848, 'learning_rate': 2.7613908966722004e-07, 'epoch': 0.93}
 93%|█████████▎| 3252/3507 [1:20:47<06:49,  1.61s/it]tensor([[-2.9688, -3.4688, -1.7422,  2.4062, -0.4629]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2812, -3.3594,  0.4785,  0.7695, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:34,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.38 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1562, -4.2812,  0.7305,  1.5859, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -3.3594, -0.2070,  2.4688, -1.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0938, -3.8281, -1.7344, -4.5938, -6.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8125, -2.0000,  1.9609,  4.7188, -0.7734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0000, -2.5000,  1.8125,  1.0547, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -4.5000,  0.4824,  2.4688, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:05:36,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.26
[2025-11-06 19:05:36,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.14 | bwd_microstep: 1491.40 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 1490.30 | step_microstep: 2.52
[2025-11-06 19:05:36,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.55 | bwd: 1492.22 | bwd_inner: 1.72 | bwd_allreduce: 1490.35 | step: 2.60
 93%|█████████▎| 3253/3507 [1:20:49<07:13,  1.71s/it]                                                     {'loss': 0.483, 'learning_rate': 2.739874649048202e-07, 'epoch': 0.93}
 93%|█████████▎| 3253/3507 [1:20:49<07:13,  1.71s/it]tensor([[-3.4062,  0.0918,  2.8438, -0.2119, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625,  0.1021,  4.0938, -1.4844, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6875, -4.7188,  0.0588,  0.6602, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:36,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.70 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.8125, -3.5781,  2.1875,  0.2373, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2812, -2.8125,  1.2656,  0.6328, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5000, -4.8125, -1.3203,  1.2656, -3.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.4062, -4.9688,  1.5469,  1.9688, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -4.8750, -2.1406,  1.5703, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:36,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:05:36,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.50 | bwd_microstep: 3.16 | bwd_inner_microstep: 2.26 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.15
[2025-11-06 19:05:36,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.23 | bwd: 3.83 | bwd_inner: 2.84 | bwd_allreduce: 0.86 | step: 2.23
 93%|█████████▎| 3254/3507 [1:20:50<06:11,  1.47s/it]                                                     {'loss': 0.2777, 'learning_rate': 2.7184413905664063e-07, 'epoch': 0.93}
 93%|█████████▎| 3254/3507 [1:20:50<06:11,  1.47s/it]tensor([[-3.4219, -0.8555,  2.0938,  0.7305, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0312, -6.0312, -0.5156,  2.4219, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:37,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.49 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8906, -1.2266,  2.6875,  1.5781, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7188, -4.9375, -1.1641,  3.3594, -1.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.2158,  3.3125,  2.6719, -1.9375, -1.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-6.7500, -4.4688,  1.5625,  1.9062, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -4.6562, -0.5977,  2.2656, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.9062, -0.7227,  2.9375,  0.5078, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:05:40,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.31 | optimizer_gradients: 0.20 | optimizer_step: 0.17
[2025-11-06 19:05:40,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.79 | bwd_microstep: 2608.44 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 2607.60 | step_microstep: 3.90
[2025-11-06 19:05:40,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.31 | bwd: 2609.11 | bwd_inner: 1.32 | bwd_allreduce: 2607.65 | step: 3.97
 93%|█████████▎| 3255/3507 [1:20:53<08:15,  1.97s/it]                                                     {'loss': 0.2268, 'learning_rate': 2.697091139515151e-07, 'epoch': 0.93}
 93%|█████████▎| 3255/3507 [1:20:53<08:15,  1.97s/it]tensor([[-4.6250, -1.1328,  2.1875, -0.7852, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -3.7188,  2.0469,  0.5273, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.0625, -8.0625, -4.0625,  0.2451, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:40,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.41 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.2188, -4.4688,  0.4648,  1.8281, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2188, -4.8125,  1.4375,  2.0312, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3906, -1.3984,  1.0703,  0.5312, -2.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -1.6016,  2.0469, -0.8242, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -4.6875, -1.9141,  2.3906, -1.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:41,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.22 | optimizer_step: 0.18
[2025-11-06 19:05:41,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.51 | bwd_microstep: 559.19 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 558.30 | step_microstep: 1.87
[2025-11-06 19:05:41,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.94 | bwd: 559.88 | bwd_inner: 1.40 | bwd_allreduce: 558.34 | step: 1.95
 93%|█████████▎| 3256/3507 [1:20:55<07:16,  1.74s/it]                                                     {'loss': 0.3943, 'learning_rate': 2.6758239141119745e-07, 'epoch': 0.93}
 93%|█████████▎| 3256/3507 [1:20:55<07:16,  1.74s/it]tensor([[-3.4062, -1.6641,  2.5000,  3.4531, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:41,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.47 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.4062,  0.8672,  2.8281, -0.8555, -2.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.6875, -4.5938, -0.3828,  3.9375, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -2.6875,  2.7188, -1.3906, -6.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8594, -4.3125, -1.2656,  3.4375, -0.9727]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -4.4062, -0.8594,  2.4844, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.1670,  2.9688,  2.0469, -1.8516, -1.2578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.0625, -4.9375, -0.6523,  3.6250, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:05:44,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.25 | optimizer_step: 0.36
[2025-11-06 19:05:44,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.69 | bwd_microstep: 2575.19 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 2573.95 | step_microstep: 3.10
[2025-11-06 19:05:44,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 302.19 | bwd: 2575.94 | bwd_inner: 1.80 | bwd_allreduce: 2574.00 | step: 3.17
 93%|█████████▎| 3257/3507 [1:20:57<08:42,  2.09s/it]                                                     {'loss': 0.3703, 'learning_rate': 2.6546397325035833e-07, 'epoch': 0.93}
 93%|█████████▎| 3257/3507 [1:20:57<08:42,  2.09s/it]tensor([[-4.1250, -4.3750, -0.7305,  3.6719, -1.2734]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:44,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.58 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-5.8125, -3.2188,  0.8555, -0.1270, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.9062,  0.1167,  3.5938, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0469,  1.1953,  2.4531, -0.9492, -2.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.1875, -2.6719,  2.1719, -0.2480, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0000, -4.0312,  2.0312,  1.1328, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[4.4688, 5.9688, 6.9688, 6.6562, 3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -3.2188,  1.5859,  3.4062, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:05:44,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.18 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 19:05:44,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.78 | bwd_microstep: 432.59 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 431.45 | step_microstep: 3.19
[2025-11-06 19:05:44,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 310.38 | bwd: 433.61 | bwd_inner: 1.95 | bwd_allreduce: 431.50 | step: 3.28
 93%|█████████▎| 3258/3507 [1:20:58<07:02,  1.70s/it]                                                     {'loss': 0.6099, 'learning_rate': 2.6335386127657734e-07, 'epoch': 0.93}
 93%|█████████▎| 3258/3507 [1:20:58<07:02,  1.70s/it]tensor([[-6.4062, -5.9688, -1.6484,  1.6719, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:45,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.14 | bwd_microstep: 1.77 | bwd_inner_microstep: 1.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2188, -3.9375,  0.1904,  1.7891, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.5312,  0.0371,  3.2500,  3.9531, -0.4961]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8125, -1.7188,  3.6406, -0.2354, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0938, -4.6562,  0.2441,  2.1094, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1875, -1.7812,  2.0938, -0.6562, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.8203,  2.4062,  3.9219, -1.6953, -2.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -3.9844, -0.1211,  2.1875, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:05:45,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:05:45,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.46 | bwd_microstep: 708.41 | bwd_inner_microstep: 2.95 | bwd_allreduce_microstep: 705.24 | step_microstep: 3.17
[2025-11-06 19:05:45,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.68 | bwd: 710.14 | bwd_inner: 4.63 | bwd_allreduce: 705.25 | step: 3.24
 93%|█████████▎| 3259/3507 [1:20:59<06:12,  1.50s/it]                                                     {'loss': 0.3529, 'learning_rate': 2.6125205729035097e-07, 'epoch': 0.93}
 93%|█████████▎| 3259/3507 [1:20:59<06:12,  1.50s/it]tensor([[-4.4062, -3.4531,  0.3496,  2.2812, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9609, -2.6406, -1.7891,  1.6406,  0.0684]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:46,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.11 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.18
tensor([[-1.5078,  2.2500,  2.8438, -2.3125, -2.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.7188, -7.0000, -1.5859,  2.2031, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2969, -0.0035,  2.5625, -0.3555, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9375, -5.4062, -1.1562,  2.0156, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.7422,  2.0156,  2.0156, -1.1562, -1.3984]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0625, -4.8438, -0.4707,  3.6562, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:47,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.23 | optimizer_step: 0.19
[2025-11-06 19:05:47,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 127.03 | bwd_microstep: 1246.19 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1244.97 | step_microstep: 222.05
[2025-11-06 19:05:47,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 286.15 | bwd: 1247.10 | bwd_inner: 1.96 | bwd_allreduce: 1245.01 | step: 222.23
 93%|█████████▎| 3260/3507 [1:21:01<06:32,  1.59s/it]                                                     {'loss': 0.2289, 'learning_rate': 2.591585630850835e-07, 'epoch': 0.93}
 93%|█████████▎| 3260/3507 [1:21:01<06:32,  1.59s/it]tensor([[-4.1250, -4.5312, -1.9844,  2.1250, -1.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -4.2500, -0.8164,  1.6875, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:47,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.91 | bwd_microstep: 2.14 | bwd_inner_microstep: 1.81 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.21
tensor([[-5.0938, -1.7188,  2.4219, -0.2051, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3438, -4.5625, -1.5000,  2.5469, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6875,  0.0168,  2.1875, -0.1133, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.7188, -1.9141,  1.6953,  0.1235, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -2.0312,  1.9375, -0.5469, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8906, -2.0312,  1.3438,  1.5703, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:48,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:05:48,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.39 | bwd_microstep: 110.16 | bwd_inner_microstep: 1.71 | bwd_allreduce_microstep: 108.29 | step_microstep: 2.01
[2025-11-06 19:05:48,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.37 | bwd: 112.29 | bwd_inner: 3.55 | bwd_allreduce: 108.41 | step: 2.23
 93%|█████████▎| 3261/3507 [1:21:02<05:10,  1.26s/it]                                                     {'loss': 0.4459, 'learning_rate': 2.570733804470926e-07, 'epoch': 0.93}
 93%|█████████▎| 3261/3507 [1:21:02<05:10,  1.26s/it]tensor([[-6.4062, -3.9844,  0.9375,  0.6406, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.6562, -4.0312,  1.7422, -1.0938, -6.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:48,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.69 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9375, -1.9219,  2.0312,  4.0000, -1.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7344,  1.0000,  2.7500,  0.9844, -1.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0938, -4.7500,  1.2656,  1.5938, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6406, -3.2500, -0.4688,  2.4688, -1.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.7656, -0.4160,  2.8906,  0.0139, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.1250, -6.5312, -1.7188,  1.8750, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:50,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 19:05:50,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.01 | bwd_microstep: 1734.98 | bwd_inner_microstep: 8.12 | bwd_allreduce_microstep: 1726.77 | step_microstep: 1.86
[2025-11-06 19:05:50,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.71 | bwd: 1735.69 | bwd_inner: 8.73 | bwd_allreduce: 1726.81 | step: 1.94
 93%|█████████▎| 3262/3507 [1:21:04<06:25,  1.58s/it]                                                     {'loss': 0.2682, 'learning_rate': 2.5499651115560296e-07, 'epoch': 0.93}
 93%|█████████▎| 3262/3507 [1:21:04<06:25,  1.58s/it]tensor([[-6.8125, -5.3438, -0.6914,  1.0312, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -0.9219,  2.3750,  0.3281, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -2.3906,  1.0312, -1.2812, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:50,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.45 | bwd_microstep: 2.92 | bwd_inner_microstep: 2.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4688,  0.5547,  3.3281, -1.7109, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[h264 @ 0xc58f8c0] mmco: unref short failure
tensor([[-4.7500, -4.7188, -0.9414,  3.2188, -1.8359]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2812, -3.0625,  0.5547,  2.2656, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6875,  2.5938,  4.2188, -1.5703, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6562, -2.1406,  1.9375,  0.7227, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:05:51,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:05:51,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.92 | bwd_microstep: 294.41 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 293.53 | step_microstep: 2.00
[2025-11-06 19:05:51,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 377.41 | bwd: 297.33 | bwd_inner: 3.60 | bwd_allreduce: 293.58 | step: 2.09
 93%|█████████▎| 3263/3507 [1:21:05<05:22,  1.32s/it]                                                     {'loss': 0.9663, 'learning_rate': 2.529279569827414e-07, 'epoch': 0.93}
 93%|█████████▎| 3263/3507 [1:21:05<05:22,  1.32s/it][19:05:51] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch21/Village_of_Romeoville_Ribbon_Cutting_-_videos_Hair_Care_April_14_2014.mp4, No such file or directory
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch21/Village_of_Romeoville_Ribbon_Cutting_-_videos_Hair_Care_April_14_2014.mp4... sharegpt4v_instruct_gpt4-vision_cap100k
Traceback (most recent call last):
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 718, in __getitem__
    ret=self.video_get_item(data_item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 604, in video_get_item
    image_list,frame_indices = self.load_video(video_path)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 582, in load_video
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/miniconda3/envs/visualquality/lib/python3.11/site-packages/decord/video_reader.py", line 57, in __init__
    raise RuntimeError("Error reading " + uri + "...")
RuntimeError: Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch21/Village_of_Romeoville_Ribbon_Cutting_-_videos_Hair_Care_April_14_2014.mp4...
tensor([[-4.6250, -4.0625, -0.7734,  1.9844, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -3.9375,  0.0547,  2.3906, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -4.2500,  0.1050,  1.2109, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:51,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.31 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
tensor([[-4.4062, -4.3750, -0.7500,  3.3281, -1.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.2500, -0.6602,  1.5703, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -3.9219,  1.3516,  3.6406, -2.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.5938, -2.8906,  2.0469, -1.0547, -5.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7109,  0.5156,  2.1250,  0.8320, -1.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:05:54,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.25 | optimizer_step: 0.25
[2025-11-06 19:05:54,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.45 | bwd_microstep: 963.58 | bwd_inner_microstep: 2.67 | bwd_allreduce_microstep: 960.77 | step_microstep: 2.64
[2025-11-06 19:05:54,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.79 | bwd: 964.61 | bwd_inner: 3.56 | bwd_allreduce: 960.85 | step: 2.74
 93%|█████████▎| 3264/3507 [1:21:08<07:27,  1.84s/it]                                                     {'loss': 0.1486, 'learning_rate': 2.5086771969354497e-07, 'epoch': 0.93}
 93%|█████████▎| 3264/3507 [1:21:08<07:27,  1.84s/it]tensor([[-4.2812, -1.5078,  1.8516, -0.2539, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4375, -2.2031,  2.5156, -1.8047, -6.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -4.8750, -0.2051,  1.8672, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5000, -4.8125, -0.4941,  2.4688, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:54,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.54 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.8750, -4.8438,  0.0449,  2.6875, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7969, -0.0569,  1.8750, -0.0123, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.7188, -2.6250, -0.6484,  2.3594, -0.6680]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -0.4102,  3.2812, -0.9062, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:54,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.40 | optimizer_step: 0.44
[2025-11-06 19:05:54,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.17 | bwd_microstep: 2.58 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1.23 | step_microstep: 11.45
[2025-11-06 19:05:54,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 525.74 | bwd: 3.52 | bwd_inner: 2.06 | bwd_allreduce: 1.29 | step: 11.52
 93%|█████████▎| 3265/3507 [1:21:08<05:54,  1.47s/it]                                                     {'loss': 0.1055, 'learning_rate': 2.4881580104595296e-07, 'epoch': 0.93}
 93%|█████████▎| 3265/3507 [1:21:08<05:54,  1.47s/it]tensor([[-2.8125, -3.4531, -1.3516,  2.9062, -0.3105]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -3.4219,  0.3086,  0.8711, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3438,  0.3574,  2.0156, -0.6445, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8125, -4.4375,  1.1406,  1.1172, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -5.8125, -2.7812,  1.0781, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -3.5781,  0.6484,  2.5156, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531,  1.0859,  3.4219, -1.5625, -3.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:05:59,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.49 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.5312, -4.0625, -0.2393,  0.5234, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:05:59,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.64 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 19:05:59,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 360.56 | bwd_microstep: 1.71 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.42
[2025-11-06 19:05:59,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 544.08 | bwd: 2.50 | bwd_inner: 1.44 | bwd_allreduce: 0.92 | step: 2.50
 93%|█████████▎| 3266/3507 [1:21:13<09:29,  2.36s/it]                                                     {'loss': 0.2701, 'learning_rate': 2.467722027908048e-07, 'epoch': 0.93}
 93%|█████████▎| 3266/3507 [1:21:13<09:29,  2.36s/it]tensor([[-4.6562, -4.3750, -0.6172,  2.6250, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')tensor([[-3.3750, -4.1875, -2.4688,  1.8828, -0.7266]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')

[2025-11-06 19:05:59,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.33 | bwd_microstep: 0.61 | bwd_inner_microstep: 0.51 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -4.4688, -0.3418,  2.2188, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -4.4375, -0.2578,  3.0625, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0781, -2.0312,  1.8906,  3.7969, -1.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -4.2188,  1.6875,  2.2344, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6875, -1.9219,  1.7969,  2.4688, -2.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2188, -3.2969,  0.5508,  2.7188, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:05:59,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:05:59,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.72 | bwd_microstep: 67.98 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 67.11 | step_microstep: 1.96
[2025-11-06 19:05:59,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.08 | bwd: 68.59 | bwd_inner: 1.30 | bwd_allreduce: 67.15 | step: 2.04
 93%|█████████▎| 3267/3507 [1:21:13<07:08,  1.79s/it]                                                     {'loss': 0.4828, 'learning_rate': 2.4473692667184136e-07, 'epoch': 0.93}
 93%|█████████▎| 3267/3507 [1:21:13<07:08,  1.79s/it]tensor([[-4.0312, -1.0391,  1.3125, -0.9531, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5312, -3.6406, -0.9570,  2.6719, -1.1484]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -0.5078,  1.8047, -1.3828, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -4.7500, -0.3789,  2.7031, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.5781,  1.0859,  3.0781, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:00,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.73 | bwd_microstep: 1.96 | bwd_inner_microstep: 1.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1562, -3.0781,  0.4453,  2.4844, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -2.9375,  1.5078,  0.3730, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.6562, -5.2812, -0.2090,  1.4688, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:06:01,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 19:06:01,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.31 | bwd_microstep: 819.14 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 818.24 | step_microstep: 2.33
[2025-11-06 19:06:01,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.06 | bwd: 821.10 | bwd_inner: 2.66 | bwd_allreduce: 818.29 | step: 2.41
 93%|█████████▎| 3268/3507 [1:21:15<07:30,  1.89s/it]                                                     {'loss': 0.4001, 'learning_rate': 2.4270997442570335e-07, 'epoch': 0.93}
 93%|█████████▎| 3268/3507 [1:21:15<07:30,  1.89s/it]tensor([[-4.4062, -1.9766,  2.0781,  1.4609, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7344,  0.0364,  1.5312, -0.4668, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.3750, -3.7656, -0.8984,  3.3438, -0.7930]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:02,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.20 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-5.2500, -4.3438, -0.0977,  2.3750, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -4.4688, -1.9766,  1.9453, -1.5859]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0000, -1.8359,  2.4688,  0.6602, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -2.4531,  1.1328, -0.1196, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -2.4844,  1.7969,  0.8906, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:06:02,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.19 | optimizer_step: 0.17
[2025-11-06 19:06:02,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.53 | bwd_microstep: 326.51 | bwd_inner_microstep: 5.31 | bwd_allreduce_microstep: 321.11 | step_microstep: 2.17
[2025-11-06 19:06:02,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 444.74 | bwd: 327.27 | bwd_inner: 5.95 | bwd_allreduce: 321.17 | step: 2.29
 93%|█████████▎| 3269/3507 [1:21:16<06:12,  1.57s/it]                                                     {'loss': 0.2183, 'learning_rate': 2.406913477819273e-07, 'epoch': 0.93}
 93%|█████████▎| 3269/3507 [1:21:16<06:12,  1.57s/it]tensor([[-2.3281,  1.4375,  3.0781, -1.3906, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -3.7188,  0.4492,  2.6719, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5938, -4.8750, -0.2080,  3.1562, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -4.1562,  0.2266,  2.4219, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5625,  1.2188,  2.8438, -1.3828, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.1875, -2.8594,  2.1406, -0.0737, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625, -2.6094, -0.5234,  2.5000, -0.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:05,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.71 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.5625, -4.0938,  1.7500,  1.6875, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:05,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:06:05,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.92 | bwd_microstep: 1.73 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.44
[2025-11-06 19:06:05,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 488.64 | bwd: 2.66 | bwd_inner: 1.74 | bwd_allreduce: 0.78 | step: 2.53
 93%|█████████▎| 3270/3507 [1:21:19<08:00,  2.03s/it]                                                     {'loss': 0.6708, 'learning_rate': 2.386810484629476e-07, 'epoch': 0.93}
 93%|█████████▎| 3270/3507 [1:21:19<08:00,  2.03s/it]tensor([[-4.5312, -0.1973,  3.0625, -2.2656, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7188, -4.6250, -2.6406,  2.1562, -0.8477]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:06,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.35 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.2812, -6.6250, -1.8047,  1.6484, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -5.3125, -1.3672,  3.1250, -1.9922]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -2.0000,  2.9375, -0.7344, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9062, -4.9688,  0.2598,  1.2188, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5000, -2.1875,  2.3125, -0.0334, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3594,  1.2578,  3.9531, -2.2344, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:06,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.29 | optimizer_step: 0.19
[2025-11-06 19:06:06,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.94 | bwd_microstep: 47.13 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 45.92 | step_microstep: 2.29
[2025-11-06 19:06:06,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.34 | bwd: 47.96 | bwd_inner: 1.80 | bwd_allreduce: 45.99 | step: 2.38
 93%|█████████▎| 3271/3507 [1:21:20<06:07,  1.56s/it]                                                     {'loss': 0.0859, 'learning_rate': 2.3667907818409109e-07, 'epoch': 0.93}
 93%|█████████▎| 3271/3507 [1:21:20<06:07,  1.56s/it]tensor([[-4.5312, -0.4277,  3.1562, -1.3047, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -4.4062, -0.3340,  2.2656, -2.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0625, -5.1250,  0.8125,  1.8672, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6875, -4.0938, -1.8750,  1.8906, -1.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.0469,  1.1328,  5.2500,  0.5586, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0000, -0.0111,  2.4844, -2.4375, -4.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.0625,  1.7344,  4.1250, -2.5000, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:06:08,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 317.02 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.5625, -2.3906,  3.1562,  1.3984, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:08,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:06:08,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.85 | bwd_microstep: 1.98 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.95 | step_microstep: 2.05
[2025-11-06 19:06:08,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 564.91 | bwd: 2.80 | bwd_inner: 1.62 | bwd_allreduce: 1.00 | step: 2.15
 93%|█████████▎| 3272/3507 [1:21:22<06:45,  1.72s/it]                                                     {'loss': 1.0439, 'learning_rate': 2.3468543865358017e-07, 'epoch': 0.93}
 93%|█████████▎| 3272/3507 [1:21:22<06:45,  1.72s/it]tensor([[-4.8750, -4.2188, -0.4004,  2.1719, -2.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -4.6250,  0.0065,  3.6250, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0000,  1.3047,  2.6875, -0.7891, -2.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:06:08,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.92 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6875, -3.2812,  2.7031,  0.6602, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1250,  1.9219,  3.1406, -2.1875, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -0.8789,  3.7969, -0.1602, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -2.5156,  1.7578,  4.1250, -1.5234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.1875, -5.1562, -0.5664,  1.7891, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:06:08,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:06:08,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.55 | bwd_microstep: 77.16 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 76.34 | step_microstep: 1.96
[2025-11-06 19:06:08,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 403.50 | bwd: 77.79 | bwd_inner: 1.29 | bwd_allreduce: 76.38 | step: 2.04
 93%|█████████▎| 3273/3507 [1:21:22<05:19,  1.36s/it]                                                     {'loss': 0.6171, 'learning_rate': 2.3270013157252747e-07, 'epoch': 0.93}
 93%|█████████▎| 3273/3507 [1:21:22<05:19,  1.36s/it]tensor([[-5.3125, -5.3438, -2.0938,  1.5547, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.9375, -6.3438, -1.8828,  1.2891, -3.9531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -3.9531,  1.0859,  3.9375, -2.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.2812, -5.5312, -0.6211,  0.6797, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -1.5625,  3.1406, -1.1094, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:09,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.97 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.7188, -1.2422,  2.9062,  0.1689, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8125, -5.3125, -0.9180,  2.6250, -2.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.7500, -2.0469,  2.9062,  0.0366, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:06:12,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.33
[2025-11-06 19:06:12,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.02 | bwd_microstep: 2002.75 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 2001.87 | step_microstep: 2.56
[2025-11-06 19:06:12,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 287.92 | bwd: 2003.64 | bwd_inner: 1.56 | bwd_allreduce: 2001.93 | step: 2.65
 93%|█████████▎| 3274/3507 [1:21:26<08:01,  2.07s/it]                                                     {'loss': 0.2367, 'learning_rate': 2.3072315863493456e-07, 'epoch': 0.93}
 93%|█████████▎| 3274/3507 [1:21:26<08:01,  2.07s/it]tensor([[-4.5000, -1.3672,  2.2812, -0.1836, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6875, -4.6562, -1.0625,  2.7500, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:12,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.44 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -4.8125, -0.4980,  2.1250, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2812, -1.8438,  2.6875, -2.1406, -6.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8750, -5.0312, -0.2324,  0.8359, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -2.7969,  2.0625,  2.6250, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -5.4688, -1.7734,  1.8984, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -0.1670,  2.4844, -2.0156, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:13,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:06:13,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.31 | bwd_microstep: 93.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 93.02 | step_microstep: 1.47
[2025-11-06 19:06:13,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 385.78 | bwd: 94.56 | bwd_inner: 1.37 | bwd_allreduce: 93.06 | step: 1.54
 93%|█████████▎| 3275/3507 [1:21:27<06:11,  1.60s/it]                                                     {'loss': 0.2085, 'learning_rate': 2.287545215276943e-07, 'epoch': 0.93}
 93%|█████████▎| 3275/3507 [1:21:27<06:11,  1.60s/it]tensor([[-5.9688, -3.5938,  1.5703,  1.4141, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4219, -3.5000, -0.5391,  3.1562, -0.9727]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:13,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.03 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0625, -2.4375,  0.6562,  0.9844, -2.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.0000, -5.8438,  0.1826,  1.0312, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5000, -3.4062,  0.3008,  2.2500, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.9258,  2.9688,  2.9688, -2.5938, -2.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3438, -0.0913,  4.0938,  3.7031, -1.4453]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1562,  0.7070,  3.8750, -0.1201, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:06:13,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:06:13,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.29 | bwd_microstep: 203.32 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 202.51 | step_microstep: 1.70
[2025-11-06 19:06:13,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.34 | bwd: 203.97 | bwd_inner: 1.30 | bwd_allreduce: 202.55 | step: 1.78
 93%|█████████▎| 3276/3507 [1:21:27<04:58,  1.29s/it]                                                     {'loss': 0.624, 'learning_rate': 2.2679422193058297e-07, 'epoch': 0.93}
 93%|█████████▎| 3276/3507 [1:21:27<04:58,  1.29s/it]tensor([[-5.8438, -4.0000,  0.7695,  1.5625, -3.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -1.8203,  3.5469,  0.6406, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -4.3438,  0.5820,  2.8594, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:13,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.39 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -3.1250,  0.4512,  1.5469, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4219,  0.6445,  2.4844, -3.0625, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.6250, -4.5000,  1.8828,  2.9062, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -1.2656,  3.4375, -0.8945, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -0.5195,  2.6562, -1.3594, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:14,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:06:14,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.71 | bwd_microstep: 1.50 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.68 | step_microstep: 1.89
[2025-11-06 19:06:14,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.12 | bwd: 2.16 | bwd_inner: 1.31 | bwd_allreduce: 0.71 | step: 1.98
 93%|█████████▎| 3277/3507 [1:21:28<03:59,  1.04s/it]                                                     {'loss': 0.5026, 'learning_rate': 2.2484226151626932e-07, 'epoch': 0.93}
 93%|█████████▎| 3277/3507 [1:21:28<03:59,  1.04s/it]tensor([[-6.0938, -6.5000, -3.0156,  1.6953, -2.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:14,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.97 | bwd_microstep: 3.56 | bwd_inner_microstep: 3.42 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.0312, -3.5156,  0.5742,  1.6797, -3.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6719, -0.1445,  2.6719, -0.7617, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2344,  1.4141,  3.5312, -2.3906, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -3.4531,  0.8398,  2.2188, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0312, -4.2500, -0.1641,  2.3438, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.8750, -2.3281,  2.1094, -0.9453, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -4.4062, -0.2891,  2.7344, -2.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:06:14,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 19:06:14,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.80 | bwd_microstep: 294.81 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 293.99 | step_microstep: 1.98
[2025-11-06 19:06:14,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.81 | bwd: 298.37 | bwd_inner: 4.16 | bwd_allreduce: 294.05 | step: 2.07
 93%|█████████▎| 3278/3507 [1:21:28<03:32,  1.08it/s]                                                     {'loss': 0.2918, 'learning_rate': 2.2289864195030097e-07, 'epoch': 0.93}
 93%|█████████▎| 3278/3507 [1:21:28<03:32,  1.08it/s]tensor([[-5.4375, -2.0781,  2.0781, -0.4219, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:15,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 93.72 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6875, -4.1875, -0.8242,  2.0625, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7812, -1.2422,  0.5078, -1.1953, -3.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -2.7500,  1.5078,  0.2344, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.7969,  1.3125,  3.0781, -2.2969, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -1.5312,  3.5312, -0.8281, -5.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -3.6875, -1.5156,  2.4531, -0.7461]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2812, -2.5625,  3.4062,  0.7617, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:17,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 19:06:17,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.68 | bwd_microstep: 992.34 | bwd_inner_microstep: 4.06 | bwd_allreduce_microstep: 988.17 | step_microstep: 2.07
[2025-11-06 19:06:17,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 243.41 | bwd: 993.13 | bwd_inner: 4.74 | bwd_allreduce: 988.23 | step: 2.16
 93%|█████████▎| 3279/3507 [1:21:31<05:17,  1.39s/it]                                                     {'loss': 0.1347, 'learning_rate': 2.2096336489111025e-07, 'epoch': 0.93}
 93%|█████████▎| 3279/3507 [1:21:31<05:17,  1.39s/it]tensor([[-0.8047,  2.5000,  1.9922, -2.1719, -1.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.8750, -1.2188,  2.9844,  1.9219, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1562, -2.7969, -2.4844,  0.5195, -0.2041]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-2.8281, -3.2188, -1.4062,  2.3281, -0.4258]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:18,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.01 | bwd_microstep: 0.64 | bwd_inner_microstep: 0.54 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.4688, -3.6562,  0.8086,  3.5781, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.4375, -0.2637,  2.9844,  0.5508, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7500, -5.3125, -2.2969,  2.3281, -1.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -4.4688, -0.0520,  1.8906, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:19,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.22 | optimizer_step: 0.20
[2025-11-06 19:06:19,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.09 | bwd_microstep: 1.61 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.78
[2025-11-06 19:06:19,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.11 | bwd: 2.25 | bwd_inner: 1.34 | bwd_allreduce: 0.76 | step: 1.86
 94%|█████████▎| 3280/3507 [1:21:33<05:47,  1.53s/it]                                                     {'loss': 0.3107, 'learning_rate': 2.1903643199000846e-07, 'epoch': 0.94}
 94%|█████████▎| 3280/3507 [1:21:33<05:47,  1.53s/it]tensor([[-6.0938, -4.4375,  0.0703,  0.8555, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5625, -3.7031,  0.9648,  1.8203, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3750, -4.3750, -1.0078,  2.9062, -1.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:19,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.51 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5781,  0.9258,  3.5469, -2.3594, -4.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -1.5000,  1.3281, -1.5000, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.7344,  1.2812,  2.3750, -0.5586, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.4062, -2.4531,  2.1250,  2.4688, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1406, -1.2969,  1.2656,  1.3359, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:23,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.31
[2025-11-06 19:06:23,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.46 | bwd_microstep: 3810.31 | bwd_inner_microstep: 8.71 | bwd_allreduce_microstep: 3801.49 | step_microstep: 2.40
[2025-11-06 19:06:23,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 370.00 | bwd: 3810.98 | bwd_inner: 9.28 | bwd_allreduce: 3801.55 | step: 2.48
 94%|█████████▎| 3281/3507 [1:21:37<08:48,  2.34s/it]                                                     {'loss': 0.6183, 'learning_rate': 2.1711784489119146e-07, 'epoch': 0.94}
 94%|█████████▎| 3281/3507 [1:21:37<08:48,  2.34s/it]tensor([[-6.5000, -4.0625,  0.9961,  0.9102, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:23,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.89 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.1250, -3.9844, -0.3262, -0.3105, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1250,  0.2178,  3.3906, -1.9141, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.3516,  0.6875,  4.4375,  4.6250, -0.4824]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -1.3750,  2.0156,  0.4043, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -1.1250,  3.1875, -0.1865, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875, -0.8047,  2.0938,  0.9062, -2.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9062, -2.7500,  1.0781,  2.8906, -2.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:06:23,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:06:23,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.16 | bwd_microstep: 137.85 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 136.77 | step_microstep: 1.64
[2025-11-06 19:06:23,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.07 | bwd: 138.94 | bwd_inner: 2.00 | bwd_allreduce: 136.81 | step: 1.73
 94%|█████████▎| 3282/3507 [1:21:37<06:43,  1.79s/it]                                                     {'loss': 0.357, 'learning_rate': 2.1520760523173313e-07, 'epoch': 0.94}
 94%|█████████▎| 3282/3507 [1:21:37<06:43,  1.79s/it]tensor([[-4.4688, -2.0312,  1.6562,  0.4160, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -1.5469,  2.9688,  0.1748, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4375, -4.3438, -0.4453,  3.4531, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:24,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.08 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4688, -2.3281,  2.7656, -1.3516, -5.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -1.8125,  2.4844,  1.0781, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8047, -2.6562, -1.8516,  2.0312,  0.4355]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4531, -2.1562,  0.5781,  1.6797, -1.9141]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -0.8516,  2.4531,  0.5508, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:06:26,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 19:06:26,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.80 | bwd_microstep: 2409.93 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2408.78 | step_microstep: 2.20
[2025-11-06 19:06:26,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 295.91 | bwd: 2410.79 | bwd_inner: 1.82 | bwd_allreduce: 2408.83 | step: 2.28
 94%|█████████▎| 3283/3507 [1:21:40<07:45,  2.08s/it]                                                     {'loss': 1.0704, 'learning_rate': 2.133057146415829e-07, 'epoch': 0.94}
 94%|█████████▎| 3283/3507 [1:21:40<07:45,  2.08s/it]tensor([[-5.2812, -4.0312,  0.0957,  1.8203, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5000, -1.3047,  1.4219, -1.3594, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:26,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.46 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.1250, -5.5000, -0.0598,  1.6172, -4.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2031, -3.7656, -1.0078,  3.5938, -0.4863]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -4.5938, -1.1875,  3.4531, -1.1797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0938, -1.8594,  1.0625,  0.2891, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.9062, -5.0000,  1.3906,  0.7852, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8125, -5.3125,  1.3203,  1.7109, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:06:27,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:06:27,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.46 | bwd_microstep: 23.01 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 21.84 | step_microstep: 2.06
[2025-11-06 19:06:27,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.95 | bwd: 23.98 | bwd_inner: 1.96 | bwd_allreduce: 21.89 | step: 2.15
 94%|█████████▎| 3284/3507 [1:21:40<05:52,  1.58s/it]                                                     {'loss': 0.5997, 'learning_rate': 2.114121747435649e-07, 'epoch': 0.94}
 94%|█████████▎| 3284/3507 [1:21:40<05:52,  1.58s/it]tensor([[-4.5625, -4.7188, -1.8438,  1.8984, -1.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7500, -1.4219,  2.2812,  3.6562, -1.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:27,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.49 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7656,  0.1079,  1.7656, -2.6719, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -4.9062, -0.0306,  2.2656, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[h264 @ 0x8225480] SEI type 0 size 64 truncated at 56
[h264 @ 0x820ec00] SEI type 0 size 64 truncated at 56
[h264 @ 0x820ec00] SEI type 0 size 64 truncated at 56
tensor([[-2.8750, -3.2031, -1.9375,  1.0391, -0.8594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-4.7812, -4.4062, -1.4219,  1.0469, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1875, -2.3750,  1.8359,  2.4219, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -1.6875,  1.9375, -2.1406, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:29,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 19:06:29,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.43 | bwd_microstep: 2207.64 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 2206.67 | step_microstep: 2.27
[2025-11-06 19:06:29,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.94 | bwd: 2208.47 | bwd_inner: 1.61 | bwd_allreduce: 2206.72 | step: 2.35
 94%|█████████▎| 3285/3507 [1:21:43<06:55,  1.87s/it]                                                     {'loss': 0.8914, 'learning_rate': 2.0952698715338226e-07, 'epoch': 0.94}
 94%|█████████▎| 3285/3507 [1:21:43<06:55,  1.87s/it]tensor([[-5.1562, -2.8281,  2.8594,  2.9219, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.2812, -3.0938,  2.3750,  0.6055, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([3], device='cuda:2')
[2025-11-06 19:06:29,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.26 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.5625, -5.7812, -2.6562,  1.2422, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.4844,  0.0608,  1.4922, -2.8594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -4.9062, -1.9219,  1.0625, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3828,  1.5859,  3.0625, -0.2217, -1.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -2.4531,  1.1328,  0.7266, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.9062, -4.9062,  1.5391,  0.8203, -5.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:06:30,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:06:30,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.18 | bwd_microstep: 84.26 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 82.97 | step_microstep: 1.87
[2025-11-06 19:06:30,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 320.46 | bwd: 85.18 | bwd_inner: 2.06 | bwd_allreduce: 83.00 | step: 1.94
 94%|█████████▎| 3286/3507 [1:21:43<05:18,  1.44s/it]                                                     {'loss': 0.8981, 'learning_rate': 2.0765015347960716e-07, 'epoch': 0.94}
 94%|█████████▎| 3286/3507 [1:21:43<05:18,  1.44s/it]tensor([[-5.0000, -5.0625, -1.3672,  2.7188, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156, -2.8281, -0.1221,  3.7031, -0.2227]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -4.5000,  0.2041,  2.3594, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -3.1875,  2.0625, -0.3965, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:30,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.83 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4062, -4.9688, -0.1201,  1.9297, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8750, -3.5469,  1.9531, -0.1289, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1562, -3.1562,  1.0391,  1.3750, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2500, -3.5625,  1.8906,  1.1875, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:06:32,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.19 | optimizer_step: 0.29
[2025-11-06 19:06:32,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.82 | bwd_microstep: 1961.73 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 1960.87 | step_microstep: 2.05
[2025-11-06 19:06:32,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 648.68 | bwd: 1962.69 | bwd_inner: 1.63 | bwd_allreduce: 1960.91 | step: 2.13
 94%|█████████▎| 3287/3507 [1:21:46<06:37,  1.81s/it]                                                     {'loss': 0.3142, 'learning_rate': 2.0578167532368742e-07, 'epoch': 0.94}
 94%|█████████▎| 3287/3507 [1:21:46<06:37,  1.81s/it]tensor([[-6.3125, -2.5312,  2.6094, -0.6953, -5.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:32,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.13 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2500, -5.0938, -1.3125,  2.5938, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.3281,  0.8203,  2.2500, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5625,  2.3281,  3.8438, -3.0156, -4.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0938, -2.6719,  1.4531,  0.3359, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.6250, -7.4375, -1.2734,  1.8984, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4531, -1.5547,  2.1719,  2.1875, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -4.2500,  0.5312,  2.4375, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:06:33,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:06:33,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.70 | bwd_microstep: 84.53 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 83.31 | step_microstep: 1.91
[2025-11-06 19:06:33,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.86 | bwd: 85.45 | bwd_inner: 1.96 | bwd_allreduce: 83.36 | step: 2.00
 94%|█████████▍| 3288/3507 [1:21:47<05:10,  1.42s/it]                                                     {'loss': 0.4018, 'learning_rate': 2.0392155427993554e-07, 'epoch': 0.94}
 94%|█████████▍| 3288/3507 [1:21:47<05:10,  1.42s/it]tensor([[-4.7500, -2.1094,  1.6406,  0.2012, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -1.1172,  2.4375, -2.3906, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.1875, -4.7500, -0.2441,  3.2812, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.0312, -4.1875,  0.0444,  2.8906, -2.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) [2025-11-06 19:06:33,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.01 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([3], device='cuda:3')
tensor([[-5.0625, -2.4531,  1.8750,  0.4160, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.4688, -4.9688,  0.6094,  0.5234, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -4.5000,  0.0439,  2.8281, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.6250, -0.2305,  1.4766, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:06:35,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.62 | optimizer_gradients: 0.27 | optimizer_step: 0.30
[2025-11-06 19:06:35,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.73 | bwd_microstep: 2012.26 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2011.17 | step_microstep: 3.19
[2025-11-06 19:06:35,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.77 | bwd: 2013.08 | bwd_inner: 1.71 | bwd_allreduce: 2011.22 | step: 3.27
 94%|█████████▍| 3289/3507 [1:21:49<06:15,  1.72s/it]                                                     {'loss': 0.6319, 'learning_rate': 2.0206979193554187e-07, 'epoch': 0.94}
 94%|█████████▍| 3289/3507 [1:21:49<06:15,  1.72s/it]tensor([[-6.8125, -5.8750, -0.8047,  2.4375, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:35,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 106.66 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.5625, -2.6094,  1.6484,  1.9297, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -2.1250,  1.6562,  1.6797, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -3.3281,  0.5078,  1.3438, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8203, -2.5312, -1.8906,  1.4531,  0.1895]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-3.2188,  0.8711,  2.6406, -2.4688, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4375, -4.5938,  0.1826,  0.9648, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0312, -1.5938,  3.1094, -1.6172, -5.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:36,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:06:36,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.18 | bwd_microstep: 205.30 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 203.80 | step_microstep: 1.57
[2025-11-06 19:06:36,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.86 | bwd: 206.26 | bwd_inner: 2.28 | bwd_allreduce: 203.85 | step: 1.65
 94%|█████████▍| 3290/3507 [1:21:50<04:56,  1.37s/it]                                                     {'loss': 0.5867, 'learning_rate': 2.0022638987055698e-07, 'epoch': 0.94}
 94%|█████████▍| 3290/3507 [1:21:50<04:56,  1.37s/it]tensor([[-7.0000, -3.6406,  2.4062,  0.4727, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -4.2500, -0.9766,  2.5312, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -0.1006,  2.2344, -1.0859, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7812, -5.0625, -1.5469,  3.0156, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:36,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.77 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.9531, -1.0000,  1.7344, -0.4258, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6875,  2.3906,  3.6250, -1.9219, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.0000, -4.0625, -0.6016,  1.4844, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.8281, -2.3594,  1.5703,  2.5000, -2.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:38,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:06:38,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.99 | bwd_microstep: 2170.54 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 2169.20 | step_microstep: 1.94
[2025-11-06 19:06:38,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.78 | bwd: 2171.48 | bwd_inner: 2.11 | bwd_allreduce: 2169.23 | step: 1.98
 94%|█████████▍| 3291/3507 [1:21:52<06:16,  1.74s/it]                                                     {'loss': 0.4264, 'learning_rate': 1.983913496578993e-07, 'epoch': 0.94}
 94%|█████████▍| 3291/3507 [1:21:52<06:16,  1.74s/it]tensor([[-6.0000, -2.4688,  2.7031,  0.1006, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9375, -4.0625,  2.1562,  1.2734, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -3.6250,  0.8008,  1.6875, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:39,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.66 | bwd_microstep: 1.14 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7812, -5.0938,  0.1357,  3.8438, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -4.6250, -1.5938,  2.3438, -1.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5938, -1.0312,  2.6406, -0.4961, -4.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -4.5938, -0.0962,  2.3281, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1250, -4.9688,  1.1016,  2.2188, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:39,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:06:39,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.72 | bwd_microstep: 1.87 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.70 | step_microstep: 1.42
[2025-11-06 19:06:39,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 494.40 | bwd: 3.01 | bwd_inner: 2.15 | bwd_allreduce: 0.74 | step: 1.51
 94%|█████████▍| 3292/3507 [1:21:53<04:56,  1.38s/it]                                                     {'loss': 0.6637, 'learning_rate': 1.9656467286335523e-07, 'epoch': 0.94}
 94%|█████████▍| 3292/3507 [1:21:53<04:56,  1.38s/it]tensor([[-5.7188, -2.3906,  2.5312, -0.0713, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3125, -3.5156,  2.3438, -0.6211, -6.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -3.2656,  1.0703,  0.6641, -3.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -4.3750,  0.4297,  2.0781, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:39,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.30 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.2500, -3.4844,  1.1562,  2.1875, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -2.0000,  1.3828,  2.7500, -1.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7812, -1.6406,  2.9375, -1.2656, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.8750, -4.2812,  0.3867,  1.9844, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:06:40,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.26
[2025-11-06 19:06:40,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.63 | bwd_microstep: 991.92 | bwd_inner_microstep: 4.88 | bwd_allreduce_microstep: 986.94 | step_microstep: 2.00
[2025-11-06 19:06:40,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.95 | bwd: 992.76 | bwd_inner: 5.64 | bwd_allreduce: 986.98 | step: 2.07
 94%|█████████▍| 3293/3507 [1:21:54<04:53,  1.37s/it]                                                     {'loss': 0.201, 'learning_rate': 1.9474636104557244e-07, 'epoch': 0.94}
 94%|█████████▍| 3293/3507 [1:21:54<04:53,  1.37s/it]tensor([[-6.4688, -5.0625, -1.1328,  0.4336, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -2.6562,  2.3906,  0.3457, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8750, -4.8750, -3.1875,  1.5469, -1.0234]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:40,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.86 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.8281, -2.8906,  0.8672,  4.9062, -0.3711]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.8750, -5.5312, -0.0124,  2.5312, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.5312, -2.3125,  0.6055,  1.3125, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([2], device='cuda:3')
tensor([[-7.3438, -5.3125,  0.9180,  2.2656, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-2.3125,  1.2031,  2.9062, -1.2031, -2.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:06:41,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:06:41,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.09 | bwd_microstep: 16.78 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 15.73 | step_microstep: 2.11
[2025-11-06 19:06:41,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 408.98 | bwd: 17.68 | bwd_inner: 1.78 | bwd_allreduce: 15.77 | step: 2.19
 94%|█████████▍| 3294/3507 [1:21:55<03:53,  1.10s/it]                                                     {'loss': 0.4301, 'learning_rate': 1.9293641575606203e-07, 'epoch': 0.94}
 94%|█████████▍| 3294/3507 [1:21:55<03:53,  1.10s/it]tensor([[-7.3125, -6.1250, -0.7070,  1.8594, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:41,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.67 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.6719,  0.2910,  1.8438, -3.1406, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.5625, -2.3438,  1.8594,  1.7891, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -2.9688,  0.1592,  1.3984, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -0.9609,  3.5156, -0.6602, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5938, -3.5312,  1.2031,  1.1953, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -4.7188, -1.8750,  2.6406, -1.3203]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0000, -3.3594,  0.9570,  1.9844, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:43,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.17 | optimizer_step: 0.21
[2025-11-06 19:06:43,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.89 | bwd_microstep: 1752.29 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1751.30 | step_microstep: 2.34
[2025-11-06 19:06:43,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 272.57 | bwd: 1753.24 | bwd_inner: 1.73 | bwd_allreduce: 1751.35 | step: 2.42
 94%|█████████▍| 3295/3507 [1:21:57<04:54,  1.39s/it]                                                     {'loss': 0.6203, 'learning_rate': 1.9113483853919756e-07, 'epoch': 0.94}
 94%|█████████▍| 3295/3507 [1:21:57<04:54,  1.39s/it]tensor([[-6.7812, -4.5000,  1.6250,  2.2656, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.5625, -6.9375, -0.9297,  1.3281, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9375, -4.5000, -0.1914,  1.3203, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.9062, -4.2500,  1.1719,  2.8281, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -4.1875, -0.0053,  3.6406, -1.9141]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2188, -1.2812,  0.9844,  0.9102, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2969,  0.9883,  2.7656, -2.5156, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:44,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.53 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9375, -1.7188,  2.0469, -0.2793, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:44,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.25 | optimizer_step: 0.20
[2025-11-06 19:06:44,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 329.40 | bwd_microstep: 2.72 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 1.11 | step_microstep: 2.55
[2025-11-06 19:06:44,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 458.96 | bwd: 3.45 | bwd_inner: 2.09 | bwd_allreduce: 1.17 | step: 2.62
 94%|█████████▍| 3296/3507 [1:21:58<04:59,  1.42s/it]                                                     {'loss': 0.443, 'learning_rate': 1.8934163093220715e-07, 'epoch': 0.94}
 94%|█████████▍| 3296/3507 [1:21:58<04:59,  1.42s/it]tensor([[-4.8125, -3.9375,  1.0312,  4.0938, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3125, -2.8281,  1.9297,  3.5312, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -2.2812,  3.6094, -0.8945, -6.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:45,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.05 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-7.7812, -4.7812,  1.4531,  0.4199, -5.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -2.9688,  0.9688,  1.4609, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.4531,  1.1328,  2.0781, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5312, -2.0469,  2.3594, -0.0835, -4.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9062, -3.2500,  0.0077,  2.5469, -1.7734]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:06:46,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.34 | optimizer_step: 0.24
[2025-11-06 19:06:46,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.40 | bwd_microstep: 1594.76 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 1593.39 | step_microstep: 2.77
[2025-11-06 19:06:46,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 417.45 | bwd: 1595.77 | bwd_inner: 2.12 | bwd_allreduce: 1593.46 | step: 2.89
 94%|█████████▍| 3297/3507 [1:22:00<05:38,  1.61s/it]                                                     {'loss': 0.4433, 'learning_rate': 1.875567944651835e-07, 'epoch': 0.94}
 94%|█████████▍| 3297/3507 [1:22:00<05:38,  1.61s/it]tensor([[-3.6719, -3.5938, -0.3438,  3.1250, -1.1797]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.8438,  2.4688,  3.1406, -3.0781, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:47,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 285.94 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-7.8125, -4.1250,  1.8750, -0.7578, -6.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -5.0938,  0.4160,  2.1250, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1875, -5.0938, -0.1445,  2.6094, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2656,  0.7734,  2.8438, -1.8516, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -2.1719,  1.4688, -1.0547, -4.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.5469,  1.3047,  3.5625,  1.5000, -1.5078]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:06:47,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.27 | optimizer_step: 0.23
[2025-11-06 19:06:47,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.92 | bwd_microstep: 41.68 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 40.34 | step_microstep: 2.43
[2025-11-06 19:06:47,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 472.90 | bwd: 42.71 | bwd_inner: 2.09 | bwd_allreduce: 40.40 | step: 2.54
 94%|█████████▍| 3298/3507 [1:22:01<04:32,  1.30s/it]                                                     {'loss': 0.153, 'learning_rate': 1.8578033066107392e-07, 'epoch': 0.94}
 94%|█████████▍| 3298/3507 [1:22:01<04:32,  1.30s/it]tensor([[-1.5859,  2.3906,  5.9375,  1.3906, -2.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:47,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.91 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-4.8125, -4.1875, -0.5508,  2.0312, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.2969,  2.2188,  2.8594, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.3438, -5.0312, -0.8594,  0.9727, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([3], device='cuda:1')
tensor([[-5.0312, -1.7969,  2.4844,  0.0452, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -3.9375,  0.3418,  2.1719, -3.0156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.0156, -3.3906, -0.9141,  3.0781, -0.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -1.4844,  2.1719, -1.1328, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:06:50,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.30
[2025-11-06 19:06:50,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.12 | bwd_microstep: 2234.34 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 2233.06 | step_microstep: 2.26
[2025-11-06 19:06:50,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.06 | bwd: 2235.24 | bwd_inner: 1.93 | bwd_allreduce: 2233.13 | step: 2.36
 94%|█████████▍| 3299/3507 [1:22:03<05:53,  1.70s/it]                                                     {'loss': 0.1347, 'learning_rate': 1.8401224103568038e-07, 'epoch': 0.94}
 94%|█████████▍| 3299/3507 [1:22:03<05:53,  1.70s/it]tensor([[-2.6094,  1.0312,  1.7969, -2.8438, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.0938, -1.2422,  1.9766, -1.8828, -4.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:50,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.43 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-3.9219, -4.4062, -1.6250,  2.9375, -1.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.7812, -5.5625, -1.3828,  0.9023, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -2.6562,  1.5547,  1.7578, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -3.8125,  0.5195,  2.4844, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8438, -6.8750, -2.1250,  0.4727, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1250, -6.4688, -0.4199,  1.4219, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:50,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:06:50,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.96 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.86 | step_microstep: 1.45
[2025-11-06 19:06:50,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 462.43 | bwd: 2.82 | bwd_inner: 1.74 | bwd_allreduce: 0.92 | step: 1.55
 94%|█████████▍| 3300/3507 [1:22:04<04:37,  1.34s/it]                                                     {'loss': 1.0795, 'learning_rate': 1.822525270976605e-07, 'epoch': 0.94}
 94%|█████████▍| 3300/3507 [1:22:04<04:37,  1.34s/it]tensor([[-0.4844,  2.6875,  2.6875, -1.2500, -1.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:50,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 122.46 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.12
tensor([[-6.4688, -5.2812, -0.2520,  2.2188, -3.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1250, -2.2344, -0.6133,  0.6328, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -2.6250,  1.1250, -0.8672, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.3594,  0.4395,  1.6641, -2.9844, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3125, -2.9688,  2.1562,  0.2930, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -5.9688, -1.6016,  2.0781, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -5.6875, -2.5469,  2.3594, -1.9766]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:06:51,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.36 | optimizer_step: 0.30
[2025-11-06 19:06:51,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.95 | bwd_microstep: 621.77 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 620.58 | step_microstep: 3.17
[2025-11-06 19:06:51,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 307.43 | bwd: 622.80 | bwd_inner: 1.97 | bwd_allreduce: 620.66 | step: 3.30
 94%|█████████▍| 3301/3507 [1:22:05<04:13,  1.23s/it]                                                     {'loss': 0.6878, 'learning_rate': 1.8050119034852765e-07, 'epoch': 0.94}
 94%|█████████▍| 3301/3507 [1:22:05<04:13,  1.23s/it]tensor([[-3.1875, -4.4062, -2.6719,  2.4844, -0.3184]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6719, -0.8047,  0.8047, -1.9141, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.9062, -3.7344,  2.3594,  1.2656, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6445,  2.9844,  2.3750, -2.4688, -1.9297]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-7.3125, -5.5938,  0.0811,  1.6172, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -4.4375, -0.0210,  2.8281, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.3125, -4.8125,  1.3828,  1.6719, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:53,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.85 | bwd_microstep: 1.11 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7188, -4.7500, -1.3750,  2.3438, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:54,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 19:06:54,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 243.92 | bwd_microstep: 2.38 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.54
[2025-11-06 19:06:54,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 543.80 | bwd: 3.48 | bwd_inner: 2.41 | bwd_allreduce: 0.93 | step: 2.64
 94%|█████████▍| 3302/3507 [1:22:07<05:34,  1.63s/it]                                                     {'loss': 0.4432, 'learning_rate': 1.787582322826431e-07, 'epoch': 0.94}
 94%|█████████▍| 3302/3507 [1:22:07<05:34,  1.63s/it]tensor([[-6.3750, -2.1875,  2.7969, -1.2031, -5.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1562, -4.9375,  1.0938,  1.7500, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.9375, -2.0312,  2.3906, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -2.1562,  3.3438, -0.5820, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:54,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.46 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.9688,  1.8828,  2.3906, -2.6562, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3750, -1.2656,  3.4062, -0.5195, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -1.6875,  1.5781,  0.1309, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.3750, -0.1025,  2.8438, -2.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:06:55,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.20 | optimizer_step: 0.18
[2025-11-06 19:06:55,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.23 | bwd_microstep: 461.95 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 460.67 | step_microstep: 2.11
[2025-11-06 19:06:55,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 431.71 | bwd: 462.89 | bwd_inner: 2.04 | bwd_allreduce: 460.71 | step: 2.20
 94%|█████████▍| 3303/3507 [1:22:08<04:50,  1.42s/it]                                                     {'loss': 0.346, 'learning_rate': 1.7702365438722058e-07, 'epoch': 0.94}
 94%|█████████▍| 3303/3507 [1:22:08<04:50,  1.42s/it]tensor([[-1.5625,  1.6406,  1.7188, -1.6641, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6250, -3.5781,  0.4375,  2.6406, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8438, -0.0100,  3.8594, -0.0410, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -5.9375, -1.6641,  3.0000, -2.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -3.5469,  0.8633,  1.8047, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -3.7656,  1.1562,  0.3594, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -3.8125,  1.0391,  1.6094, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:06:56,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.61 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-3.5469,  0.1650,  2.2188, -1.8047, -3.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:56,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:06:56,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.38 | bwd_microstep: 2.16 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 0.85 | step_microstep: 2.36
[2025-11-06 19:06:56,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.01 | bwd: 3.19 | bwd_inner: 2.15 | bwd_allreduce: 0.89 | step: 2.46
 94%|█████████▍| 3304/3507 [1:22:10<05:07,  1.52s/it]                                                     {'loss': 0.6478, 'learning_rate': 1.7529745814232168e-07, 'epoch': 0.94}
 94%|█████████▍| 3304/3507 [1:22:10<05:07,  1.52s/it]tensor([[-3.8594,  0.8438,  3.4688, -2.7188, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2031, -0.2148,  2.0938, -0.7578, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969, -0.1914,  2.3125, -0.7148, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:06:56,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.46 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.1875, -6.0000, -1.4844,  0.2451, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.1875, -4.5625,  1.5234,  1.4609, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.2109,  2.8281,  3.8125, -1.5938, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5625, -6.3438, -0.2002,  2.8906, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -2.5625,  1.0859,  1.2578, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:06:57,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:06:57,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.68 | bwd_microstep: 370.65 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 369.16 | step_microstep: 1.51
[2025-11-06 19:06:57,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 365.16 | bwd: 371.46 | bwd_inner: 2.13 | bwd_allreduce: 369.20 | step: 1.60
 94%|█████████▍| 3305/3507 [1:22:11<04:21,  1.29s/it]                                                     {'loss': 0.304, 'learning_rate': 1.7357964502086155e-07, 'epoch': 0.94}
 94%|█████████▍| 3305/3507 [1:22:11<04:21,  1.29s/it]tensor([[-4.1875, -0.1582,  2.1406, -2.5625, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1562, -5.4688, -0.9102,  2.2188, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -1.4375,  1.8438, -1.1953, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.2500, -7.2812, -1.9141,  1.2188, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2500, -2.8906,  2.5938,  0.7891, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6680,  2.6875,  2.7188, -1.3281, -1.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.7500, -3.2344,  0.7852,  1.9297, -2.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:00,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.05 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.9688, -2.0781,  2.1250,  0.6641, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:00,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:07:00,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.80 | bwd_microstep: 1.94 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.08
[2025-11-06 19:07:00,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 379.87 | bwd: 2.92 | bwd_inner: 1.93 | bwd_allreduce: 0.86 | step: 2.17
 94%|█████████▍| 3306/3507 [1:22:14<05:51,  1.75s/it]                                                     {'loss': 0.2115, 'learning_rate': 1.718702164885966e-07, 'epoch': 0.94}
 94%|█████████▍| 3306/3507 [1:22:14<05:51,  1.75s/it]tensor([[-3.8125, -4.4062, -1.4766,  3.2188, -0.9102]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8750, -4.8750, -0.9141,  3.3750, -1.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:00,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.63 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.0156,  1.0703,  2.8750, -2.0000, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.2812, -1.4922,  2.7031, -0.5859, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7500, -4.0625,  1.1484,  2.5469, -3.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.4688,  0.2969,  3.9375,  0.2246, -3.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -0.1025,  3.4062, -2.2344, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2812, -1.6094,  2.5000, -1.0312, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:07:00,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:07:00,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.45 | bwd_microstep: 19.50 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 18.30 | step_microstep: 1.53
[2025-11-06 19:07:00,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 470.11 | bwd: 20.37 | bwd_inner: 1.91 | bwd_allreduce: 18.34 | step: 1.63
 94%|█████████▍| 3307/3507 [1:22:14<04:36,  1.38s/it]                                                     {'loss': 0.4776, 'learning_rate': 1.7016917400413002e-07, 'epoch': 0.94}
 94%|█████████▍| 3307/3507 [1:22:14<04:36,  1.38s/it]tensor([[-4.4062, -4.5312, -1.5703,  2.1562, -1.7578]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -3.7812,  0.8008,  1.0938, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4062, -2.7031, -0.5391,  3.0781, -0.2109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.6562, -3.7344,  1.7578,  0.5898, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -1.1328,  3.0781, -0.9609, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0312, -5.6250,  0.0537,  2.3906, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -4.7500, -1.5625,  3.1875, -1.2422]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:01,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.71 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.1875, -1.4375,  1.7812, -2.0469, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:01,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 19:07:01,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.61 | bwd_microstep: 1.86 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.90 | step_microstep: 2.05
[2025-11-06 19:07:01,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.32 | bwd: 2.77 | bwd_inner: 1.66 | bwd_allreduce: 0.94 | step: 2.14
 94%|█████████▍| 3308/3507 [1:22:15<03:51,  1.16s/it]                                                     {'loss': 0.1824, 'learning_rate': 1.6847651901891081e-07, 'epoch': 0.94}
 94%|█████████▍| 3308/3507 [1:22:15<03:51,  1.16s/it]tensor([[-5.7188, -1.5781,  3.5312, -0.7578, -5.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:01,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.18 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -2.0156,  1.8359,  0.1279, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -5.1562, -2.4375,  2.2656, -1.5547]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -3.9219, -0.2402,  2.0469, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0312, -2.5781,  1.1641,  0.3125, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -3.9375,  0.2080,  0.5508, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2500, -1.8047,  1.3672,  0.3262, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -1.0156,  3.1875, -1.1797, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:07:03,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.20 | optimizer_step: 0.28
[2025-11-06 19:07:03,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.21 | bwd_microstep: 1137.22 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1135.90 | step_microstep: 2.05
[2025-11-06 19:07:03,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.43 | bwd: 1137.91 | bwd_inner: 1.80 | bwd_allreduce: 1135.94 | step: 2.13
 94%|█████████▍| 3309/3507 [1:22:16<04:12,  1.27s/it]                                                     {'loss': 0.2061, 'learning_rate': 1.6679225297723146e-07, 'epoch': 0.94}
 94%|█████████▍| 3309/3507 [1:22:16<04:12,  1.27s/it]tensor([[-6.5312, -5.3438, -0.3301,  2.0469, -3.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -2.2656,  3.3438, -0.5547, -5.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4375, -2.6250,  0.5586,  2.7031, -1.5156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -5.0625, -1.5469,  2.3906, -2.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0625, -4.2812,  0.3457,  3.3281, -2.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -2.1250,  3.2500,  0.4609, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9531, -1.0781,  1.6094, -0.3359, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:04,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.30 | bwd_microstep: 1.09 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.5312, -2.5469,  1.6641, -0.4121, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:04,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.78 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:07:04,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.31 | bwd_microstep: 1.88 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.74
[2025-11-06 19:07:04,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.63 | bwd: 2.96 | bwd_inner: 1.92 | bwd_allreduce: 0.91 | step: 2.81
 94%|█████████▍| 3310/3507 [1:22:18<04:09,  1.26s/it]                                                     {'loss': 0.0909, 'learning_rate': 1.6511637731622453e-07, 'epoch': 0.94}
 94%|█████████▍| 3310/3507 [1:22:18<04:09,  1.26s/it]tensor([[-8.5000, -6.3750, -1.5469, -1.3359, -6.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-12.3750,  -9.0625,  -1.2656,  -1.5938,  -9.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:04,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.18 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0312, -4.2500,  0.4551,  1.5078, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.8750, -8.1875, -3.4375, -0.1147, -5.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.6719,  2.5469,  3.6875, -2.3438, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6250, -5.0625, -1.5703,  3.3750, -1.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9062,  0.9609,  3.6719, -0.5117, -3.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -3.3438, -0.1592,  2.2812, -1.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:05,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 19:07:05,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.31 | bwd_microstep: 443.91 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 442.88 | step_microstep: 1.94
[2025-11-06 19:07:05,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.51 | bwd: 444.80 | bwd_inner: 1.74 | bwd_allreduce: 442.92 | step: 2.01
 94%|█████████▍| 3311/3507 [1:22:19<04:19,  1.32s/it]                                                     {'loss': 0.4458, 'learning_rate': 1.6344889346586402e-07, 'epoch': 0.94}
 94%|█████████▍| 3311/3507 [1:22:19<04:19,  1.32s/it]tensor([[-3.1719, -1.7266,  0.9141,  1.5078, -1.8828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.5625, -7.0625, -0.9062,  1.2969, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -1.0625,  3.3906, -0.7148, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0938, -3.6719,  1.6328,  1.4453, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2969,  1.6094,  4.3438, -0.1475, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -3.2656,  2.2500,  0.9453, -4.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -0.9609,  2.3906, -0.3535, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:07,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.09 | bwd_microstep: 1.33 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
tensor([[-5.5000, -4.1250,  0.6328,  2.5781, -3.1719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:07,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 19:07:07,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.11 | bwd_microstep: 2.06 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.92 | step_microstep: 2.91
[2025-11-06 19:07:07,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.22 | bwd: 3.39 | bwd_inner: 2.18 | bwd_allreduce: 0.99 | step: 3.05
 94%|█████████▍| 3312/3507 [1:22:21<04:30,  1.39s/it]                                                     {'loss': 0.2238, 'learning_rate': 1.6178980284896507e-07, 'epoch': 0.94}
 94%|█████████▍| 3312/3507 [1:22:21<04:30,  1.39s/it]tensor([[-2.7500,  1.0234,  3.2500, -1.4453, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -2.0781,  1.3906, -2.1719, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:07,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.90 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-4.9062, -3.7188,  1.0156,  3.3906, -2.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7969, -3.1250, -2.0156,  3.1875,  0.8086]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -3.0625,  1.1797,  0.5664, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4062, -3.2969,  0.7422,  2.7031, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.4062, -6.0938, -1.2734,  0.7344, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.9766,  1.2344,  2.0625, -1.9062, -2.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
[2025-11-06 19:07:09,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:07:09,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 129.52 | bwd_microstep: 1530.90 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1529.70 | step_microstep: 1.71
[2025-11-06 19:07:09,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.44 | bwd: 1531.64 | bwd_inner: 1.79 | bwd_allreduce: 1529.73 | step: 1.78
 94%|█████████▍| 3313/3507 [1:22:23<04:57,  1.54s/it]                                                     {'loss': 0.2882, 'learning_rate': 1.6013910688117972e-07, 'epoch': 0.94}
 94%|█████████▍| 3313/3507 [1:22:23<04:57,  1.54s/it]tensor([[-5.9688, -3.9219,  1.5781,  2.2031, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -1.8438,  1.9609, -0.7773, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -5.0312, -1.4922,  2.5625, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -4.0938,  0.0243,  1.2109, -3.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0000, -3.9844,  0.3555,  2.8438, -2.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0312, -6.4375, -1.7266,  1.7031, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6406,  0.0771,  3.3281, -0.4512, -3.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:11,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 83.64 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.0938, -2.0625,  1.8516, -2.2656, -5.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:11,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.21 | optimizer_step: 0.18
[2025-11-06 19:07:11,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.10 | bwd_microstep: 1.93 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.38
[2025-11-06 19:07:11,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 393.76 | bwd: 2.68 | bwd_inner: 1.56 | bwd_allreduce: 0.96 | step: 2.47
 94%|█████████▍| 3314/3507 [1:22:25<05:32,  1.72s/it]                                                     {'loss': 0.7846, 'learning_rate': 1.584968069709958e-07, 'epoch': 0.94}
 94%|█████████▍| 3314/3507 [1:22:25<05:32,  1.72s/it]tensor([[-6.6250, -6.2188, -1.2344,  2.8125, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -2.9844,  0.3770,  1.0312, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4688, -3.9688,  0.7969,  2.1094, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:11,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.29 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.9688, -3.8438, -0.3047,  3.7031, -1.2422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -4.0000, -0.4121, -0.9062, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -0.3770,  3.5781, -1.5391, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312, -3.2031,  0.7617,  2.4844, -2.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -6.1875, -2.5156,  2.1094, -2.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:13,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:07:13,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.45 | bwd_microstep: 1987.21 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 1986.16 | step_microstep: 1.75
[2025-11-06 19:07:13,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.77 | bwd: 1987.98 | bwd_inner: 1.66 | bwd_allreduce: 1986.19 | step: 1.82
 95%|█████████▍| 3315/3507 [1:22:27<06:11,  1.93s/it]                                                     {'loss': 0.2641, 'learning_rate': 1.5686290451974007e-07, 'epoch': 0.95}
 95%|█████████▍| 3315/3507 [1:22:27<06:11,  1.93s/it]tensor([[-2.7656, -2.7031,  0.5039,  4.1562, -0.4258]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-0.2227,  2.5625,  4.6562,  2.1406, -0.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:13,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.91 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.3125, -3.9688, -1.1016, -0.2754, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -4.1562, -0.1865,  1.9688, -2.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6562, -3.0469,  0.9219, -0.0452, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0938, -3.9844, -1.3750,  1.8438, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.1250, -3.6719,  0.4180,  2.0000, -3.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -4.5938, -0.8086,  2.2656, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:07:14,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 19:07:14,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.40 | bwd_microstep: 27.50 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 26.21 | step_microstep: 1.52
[2025-11-06 19:07:14,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 362.34 | bwd: 28.38 | bwd_inner: 2.01 | bwd_allreduce: 26.24 | step: 1.59
 95%|█████████▍| 3316/3507 [1:22:28<04:42,  1.48s/it]                                                     {'loss': 0.9445, 'learning_rate': 1.5523740092157068e-07, 'epoch': 0.95}
 95%|█████████▍| 3316/3507 [1:22:28<04:42,  1.48s/it]tensor([[-5.3125, -4.0312,  0.8086,  3.1406, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6250, -4.0625,  0.7070,  2.0781, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0625, -4.0938, -0.5586,  3.1719, -1.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:14,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.09 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.1562,  1.9453,  3.0312, -2.5781, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.8281,  0.4023,  2.6406, -1.2188, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -3.9062,  0.6016,  1.8750, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2188, -2.6406,  2.1562,  1.8281, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -4.3750, -0.0337,  2.0312, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:16,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.33
[2025-11-06 19:07:16,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.62 | bwd_microstep: 2008.51 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2007.40 | step_microstep: 2.50
[2025-11-06 19:07:16,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 475.75 | bwd: 2009.36 | bwd_inner: 1.76 | bwd_allreduce: 2007.44 | step: 2.58
 95%|█████████▍| 3317/3507 [1:22:30<05:40,  1.79s/it]                                                     {'loss': 0.641, 'learning_rate': 1.5362029756348373e-07, 'epoch': 0.95}
 95%|█████████▍| 3317/3507 [1:22:30<05:40,  1.79s/it]tensor([[-3.4375, -0.7266,  3.2344,  1.4297, -2.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8281,  1.3125,  2.6875, -2.8594, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -4.6562, -1.7578,  2.7969, -1.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:16,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.36 | bwd_microstep: 1.06 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2812, -1.5000,  2.9531, -0.4258, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -5.5938, -0.8906,  3.1875, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.6875, -6.5312, -0.7500,  2.0625, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719,  1.0391,  2.3750, -1.2344, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7188, -0.3184,  2.5781, -0.3027, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 19:07:17,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:07:17,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.73 | bwd_microstep: 97.95 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 96.78 | step_microstep: 1.53
[2025-11-06 19:07:17,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.11 | bwd: 99.01 | bwd_inner: 2.05 | bwd_allreduce: 96.82 | step: 1.62
 95%|█████████▍| 3318/3507 [1:22:31<04:23,  1.40s/it]                                                     {'loss': 0.4782, 'learning_rate': 1.5201159582530323e-07, 'epoch': 0.95}
 95%|█████████▍| 3318/3507 [1:22:31<04:23,  1.40s/it]tensor([[-7.2500e+00, -4.0938e+00,  1.4219e+00,  1.6937e-03, -5.6875e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:17,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.15 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5469, -2.0312,  0.8203,  1.1641, -2.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1250, -0.1196,  3.4375, -1.0703, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -3.4844,  0.5547,  2.1250, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1250, -1.4609,  2.1719,  1.0234, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0000, -5.0625, -0.5430,  2.0625, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -5.6562, -2.0469,  2.3281, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6875, -1.5859,  1.9453, -0.2676, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:20,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 19:07:20,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.88 | bwd_microstep: 2461.84 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2460.77 | step_microstep: 2.26
[2025-11-06 19:07:20,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.04 | bwd: 2462.85 | bwd_inner: 1.91 | bwd_allreduce: 2460.81 | step: 2.34
 95%|█████████▍| 3319/3507 [1:22:33<05:41,  1.82s/it]                                                     {'loss': 0.3934, 'learning_rate': 1.504112970796856e-07, 'epoch': 0.95}
 95%|█████████▍| 3319/3507 [1:22:33<05:41,  1.82s/it]tensor([[-4.0000, -3.7812, -0.3047,  3.4688, -1.3516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -3.6094,  0.4199,  1.6797, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:20,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.70 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7188, -3.7656,  0.1240,  2.4531, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -4.5938, -0.7578,  1.8438, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -2.8125,  1.7109,  1.4609, -3.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9219, -2.5938,  0.5273,  1.6016, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -5.1562, -1.6172,  2.5938, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -2.0156,  2.5000, -1.1641, -5.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:07:20,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.18 | optimizer_step: 0.15
[2025-11-06 19:07:20,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 47.91 | bwd_microstep: 315.46 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 314.49 | step_microstep: 1.66
[2025-11-06 19:07:20,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 203.62 | bwd: 316.33 | bwd_inner: 1.67 | bwd_allreduce: 314.53 | step: 1.73
 95%|█████████▍| 3320/3507 [1:22:34<04:28,  1.44s/it]                                                     {'loss': 0.6534, 'learning_rate': 1.4881940269211637e-07, 'epoch': 0.95}
 95%|█████████▍| 3320/3507 [1:22:34<04:28,  1.44s/it]tensor([[-7.7812, -5.1562,  0.8125,  0.6094, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -4.9688, -0.8516,  2.7812, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -4.1562, -0.6328,  2.7500, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:20,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.44 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.3438, -4.4375,  0.2217,  1.0781, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -1.4609,  2.9375, -1.5000, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.5625, -2.6250,  1.7891,  0.3457, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.0156,  1.6953,  3.0000, -1.7891, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9688, -3.6094,  2.4688,  0.8242, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:22,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:07:22,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.54 | bwd_microstep: 1923.64 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 1922.51 | step_microstep: 2.04
[2025-11-06 19:07:22,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.01 | bwd: 1924.42 | bwd_inner: 1.73 | bwd_allreduce: 1922.56 | step: 2.11
 95%|█████████▍| 3321/3507 [1:22:36<05:18,  1.71s/it]                                                     {'loss': 0.2397, 'learning_rate': 1.4723591402091453e-07, 'epoch': 0.95}
 95%|█████████▍| 3321/3507 [1:22:36<05:18,  1.71s/it]tensor([[-3.2500, -4.4375, -3.2344,  1.2969, -0.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 19:07:23,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.75 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.6094,  0.8672,  2.4375, -1.1016, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.6875, -4.4688, -0.2891,  1.3594, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6719,  1.5156,  4.3750, -0.5391, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2500, -3.5938,  1.4922,  3.0938, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.3750, -5.1875,  0.7852,  1.8906, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0000, -5.2500,  0.1738,  1.5391, -4.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1562,  1.7266,  3.4844, -1.4453, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:07:23,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.88 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:07:23,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.55 | bwd_microstep: 126.05 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 125.00 | step_microstep: 2.52
[2025-11-06 19:07:23,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 319.31 | bwd: 127.00 | bwd_inner: 1.83 | bwd_allreduce: 125.03 | step: 2.60
 95%|█████████▍| 3322/3507 [1:22:37<04:08,  1.34s/it]                                                     {'loss': 0.8282, 'learning_rate': 1.4566083241722262e-07, 'epoch': 0.95}
 95%|█████████▍| 3322/3507 [1:22:37<04:08,  1.34s/it]tensor([[-6.1562, -2.7500,  2.6719,  0.3086, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.3086,  2.3281,  4.7812,  2.3125, -0.6445]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -0.8359,  4.0000,  0.0308, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:23,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.39 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-6.5000, -5.8125, -0.6797,  2.6250, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9062, -4.9375, -1.0156,  3.1875, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.4375, -5.3438,  0.2598,  1.1797, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-9.4375, -6.1250,  0.3887, -0.8750, -7.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.6562, -3.8750, -0.4590,  3.6719, -1.0078]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:26,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.28 | optimizer_step: 0.34
[2025-11-06 19:07:26,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.24 | bwd_microstep: 1864.47 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 1863.44 | step_microstep: 237.09
[2025-11-06 19:07:26,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 524.66 | bwd: 1865.40 | bwd_inner: 1.75 | bwd_allreduce: 1863.49 | step: 237.18
 95%|█████████▍| 3323/3507 [1:22:39<05:20,  1.74s/it]                                                     {'loss': 0.2352, 'learning_rate': 1.4409415922500892e-07, 'epoch': 0.95}
 95%|█████████▍| 3323/3507 [1:22:39<05:20,  1.74s/it]tensor([[-4.7500, -3.9375,  0.3027,  3.0625, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -2.7656,  1.4766,  1.0156, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2500, -4.7812, -2.1562,  2.1406, -1.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:26,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.43 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.4375, -5.6250,  0.6211,  2.2969, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -3.5312,  0.5352,  4.5312, -1.0078]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5938, -5.5000, -0.4805, -0.0757, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -1.4922,  3.2500,  0.8633, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -3.4844,  1.4609,  1.9062, -3.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:26,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:07:26,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.42 | bwd_microstep: 60.52 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 59.39 | step_microstep: 1.72
[2025-11-06 19:07:26,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 402.88 | bwd: 61.48 | bwd_inner: 1.92 | bwd_allreduce: 59.43 | step: 1.79
 95%|█████████▍| 3324/3507 [1:22:40<04:10,  1.37s/it]                                                     {'loss': 0.2955, 'learning_rate': 1.4253589578106853e-07, 'epoch': 0.95}
 95%|█████████▍| 3324/3507 [1:22:40<04:10,  1.37s/it]tensor([[-3.8438, -4.6250, -2.7188,  1.5547, -1.1328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -1.5938,  2.0000, -0.7617, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:26,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.96 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-8.3750, -7.0625, -1.7266,  0.4023, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8438, -3.7500,  0.6055,  2.8438, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0312, -3.8906, -1.0156,  2.4844, -1.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.8750, -3.2656, -0.2178,  4.2500, -0.3164]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[h264 @ 0x87c9f40] mmco: unref short failure
[h264 @ 0x87c9f40] mmco: unref short failure
tensor([[-5.3125, -4.1250,  0.1270,  2.1875, -3.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4688, -4.4375, -0.6094,  3.2812, -1.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:29,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:07:29,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.55 | bwd_microstep: 2836.84 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 2835.75 | step_microstep: 2.17
[2025-11-06 19:07:29,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.55 | bwd: 2837.64 | bwd_inner: 1.73 | bwd_allreduce: 2835.78 | step: 2.24
 95%|█████████▍| 3325/3507 [1:22:43<05:50,  1.93s/it]                                                     {'loss': 0.3577, 'learning_rate': 1.409860434150223e-07, 'epoch': 0.95}
 95%|█████████▍| 3325/3507 [1:22:43<05:50,  1.93s/it]tensor([[-2.6250,  0.2910,  3.2969,  0.4316, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:29,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.23 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.0625, -4.7500, -2.4531,  1.9766, -1.2266]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.8750, -3.5000,  0.6094,  1.8516, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1875, -0.4004,  2.7656, -1.4688, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9062, -0.4902,  2.8281, -0.4902, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -1.0469,  2.2031, -0.2559, -3.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6094, -0.3711,  1.8750, -1.6562, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2188, -4.2500, -1.7500,  1.8594, -1.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:07:30,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:07:30,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.73 | bwd_microstep: 92.23 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 91.03 | step_microstep: 1.81
[2025-11-06 19:07:30,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 339.99 | bwd: 93.07 | bwd_inner: 1.88 | bwd_allreduce: 91.07 | step: 1.88
 95%|█████████▍| 3326/3507 [1:22:44<04:29,  1.49s/it]                                                     {'loss': 0.5081, 'learning_rate': 1.3944460344931133e-07, 'epoch': 0.95}
 95%|█████████▍| 3326/3507 [1:22:44<04:29,  1.49s/it]tensor([[-5.0000, -4.5000, -0.4238,  2.2969, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.03 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.0938, -4.2812, -0.2129,  0.5117, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -2.8906,  2.4219,  0.9414, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6562, -3.0312,  2.0156,  1.3203, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.7812, -5.5000, -0.9766,  1.1562, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9766, -0.7031,  2.6875,  4.2188, -0.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7656,  0.1553,  2.7812, -1.5781, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5000, -3.3125,  0.7031,  2.6562, -2.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:32,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 19:07:32,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.39 | bwd_microstep: 1944.41 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 1943.26 | step_microstep: 2.09
[2025-11-06 19:07:32,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 375.45 | bwd: 1945.37 | bwd_inner: 1.92 | bwd_allreduce: 1943.31 | step: 2.18
 95%|█████████▍| 3327/3507 [1:22:46<05:14,  1.75s/it]                                                     {'loss': 0.4509, 'learning_rate': 1.3791157719920124e-07, 'epoch': 0.95}
 95%|█████████▍| 3327/3507 [1:22:46<05:14,  1.75s/it]tensor([[-1.9531,  1.9531,  3.3750, -1.8828, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:32,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 81.76 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.2188, -2.5000,  1.8438, -1.4609, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2969,  0.2197,  2.7656, -1.2344, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -4.0625, -0.5508,  2.0312, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.7812, -3.7656, -2.7188,  1.6641, -0.2363]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0938, -5.9688, -0.6094,  2.1719, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.1250, -5.8438,  0.7852,  1.6406, -5.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -3.3594,  0.6484,  1.6484, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:07:33,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:07:33,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.00 | bwd_microstep: 278.37 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 277.22 | step_microstep: 1.63
[2025-11-06 19:07:33,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 238.78 | bwd: 279.36 | bwd_inner: 1.98 | bwd_allreduce: 277.26 | step: 1.70
 95%|█████████▍| 3328/3507 [1:22:46<04:08,  1.39s/it]                                                     {'loss': 0.2776, 'learning_rate': 1.3638696597277678e-07, 'epoch': 0.95}
 95%|█████████▍| 3328/3507 [1:22:46<04:08,  1.39s/it]tensor([[-5.4375, -2.4844,  1.8203,  0.3008, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:33,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.89 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.9688, -5.4375, -0.4414,  3.4531, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -2.7656,  1.0547,  1.4297, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6562, -1.7891,  1.1328,  3.0000, -1.0391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -2.1562,  1.8359,  0.6953, -3.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.8594,  0.9102,  3.6250, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -4.1875,  0.1553,  1.9922, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9688, -2.3906,  2.5156, -0.4082, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:35,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.19 | optimizer_step: 0.25
[2025-11-06 19:07:35,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.88 | bwd_microstep: 1752.42 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1751.13 | step_microstep: 2.38
[2025-11-06 19:07:35,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 256.79 | bwd: 1753.30 | bwd_inner: 1.97 | bwd_allreduce: 1751.18 | step: 2.46
 95%|█████████▍| 3329/3507 [1:22:49<04:41,  1.58s/it]                                                     {'loss': 0.57, 'learning_rate': 1.3487077107094182e-07, 'epoch': 0.95}
 95%|█████████▍| 3329/3507 [1:22:49<04:41,  1.58s/it]tensor([[-3.0312,  0.4785,  2.8281, -0.8711, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.4766, -0.8516,  1.0859,  3.0000, -0.1318]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.6562, -4.1875, -0.2480,  2.8906, -2.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:35,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.21 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.2344,  0.7070,  1.8047, -3.4531, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0312, -6.2500, -1.1562,  2.1250, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -4.1875,  0.7852,  2.2812, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.2500, -4.7188,  0.7578,  2.9688, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -5.5000, -0.2275,  3.4062, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:07:35,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:07:35,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.67 | bwd_microstep: 9.80 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 8.64 | step_microstep: 1.52
[2025-11-06 19:07:35,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 347.91 | bwd: 10.65 | bwd_inner: 1.85 | bwd_allreduce: 8.67 | step: 1.60
 95%|█████████▍| 3330/3507 [1:22:49<03:37,  1.23s/it]                                                     {'loss': 0.799, 'learning_rate': 1.3336299378742147e-07, 'epoch': 0.95}
 95%|█████████▍| 3330/3507 [1:22:49<03:37,  1.23s/it]tensor([[-5.9375, -2.4219,  1.3984, -1.8047, -5.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -5.0938, -1.6641,  2.4844, -2.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-11.1875,  -9.6250,  -2.9531,  -0.1611,  -7.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:35,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.72 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0312, -0.9961,  2.5312,  0.5625, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -3.4844,  1.9219,  1.2031, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-9.2500, -5.8125, -2.7812, -5.5000, -7.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9062, -4.7812,  0.0674,  2.5781, -3.2656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6875, -4.5938,  0.1035,  2.6875, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:36,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 19:07:36,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.42 | bwd_microstep: 565.63 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 564.40 | step_microstep: 2.12
[2025-11-06 19:07:36,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 487.17 | bwd: 566.72 | bwd_inner: 2.15 | bwd_allreduce: 564.44 | step: 2.22
 95%|█████████▍| 3331/3507 [1:22:50<03:28,  1.19s/it]                                                     {'loss': 0.2123, 'learning_rate': 1.3186363540875658e-07, 'epoch': 0.95}
 95%|█████████▍| 3331/3507 [1:22:50<03:28,  1.19s/it]tensor([[-5.9062, -3.7656,  1.4375,  1.8359, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -4.2188, -0.0192,  2.4531, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6875, -4.5938, -0.3105,  1.7500, -3.3281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -3.1094,  0.8125, -0.0498, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0312, -2.9062,  1.7578, -0.1484, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -5.2500, -1.5234, -0.0147, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531, -3.5000, -1.0859,  3.3125, -0.3105]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:37,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.24 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2812, -3.9844, -0.9023,  2.0156, -2.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:37,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:07:37,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 1.89 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.84 | step_microstep: 1.79
[2025-11-06 19:07:37,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.02 | bwd: 2.78 | bwd_inner: 1.74 | bwd_allreduce: 0.89 | step: 1.89
 95%|█████████▌| 3332/3507 [1:22:51<03:02,  1.04s/it]                                                     {'loss': 0.3061, 'learning_rate': 1.3037269721430268e-07, 'epoch': 0.95}
 95%|█████████▌| 3332/3507 [1:22:51<03:02,  1.04s/it]tensor([[-5.5938e+00, -3.8906e+00, -3.4485e-03,  8.2812e-01, -3.7031e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:37,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.66 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[ 0.4922,  4.0938,  3.6719, -1.2969, -1.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.1875, -2.2812,  3.2812, -0.3496, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2969,  0.3223,  1.6328, -2.1562, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3750, -4.8750, -0.7266,  2.6406, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875,  0.0605,  3.4844, -1.8047, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0938, -2.8281,  2.3438,  0.2930, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6562, -1.4375,  2.3281, -0.1660, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:39,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.25 | optimizer_step: 0.21
[2025-11-06 19:07:39,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.08 | bwd_microstep: 1508.31 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1507.29 | step_microstep: 2.26
[2025-11-06 19:07:39,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 294.76 | bwd: 1509.21 | bwd_inner: 1.73 | bwd_allreduce: 1507.34 | step: 2.34
 95%|█████████▌| 3333/3507 [1:22:53<03:42,  1.28s/it]                                                     {'loss': 0.3454, 'learning_rate': 1.2889018047623546e-07, 'epoch': 0.95}
 95%|█████████▌| 3333/3507 [1:22:53<03:42,  1.28s/it]tensor([[-0.9688, -1.6484, -1.8828,  0.7734,  0.6055]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-6.5938, -6.5625, -2.8281,  1.3125, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125, -0.1123,  2.2969, -0.3340, -3.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750,  0.7227,  4.4375, -2.3594, -5.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2031, -2.7188,  0.7852,  3.4531, -1.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.2344, -0.4922,  2.7656,  1.2578, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:40,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.53 | bwd_microstep: 1.24 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13
tensor([[-6.3750, -2.5938,  2.2812, -0.9453, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -4.4062, -0.0442,  2.5625, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:07:41,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 19:07:41,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.87 | bwd_microstep: 805.27 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 804.03 | step_microstep: 2.04
[2025-11-06 19:07:41,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 444.41 | bwd: 806.50 | bwd_inner: 2.19 | bwd_allreduce: 804.11 | step: 2.18
 95%|█████████▌| 3334/3507 [1:22:54<04:13,  1.46s/it]                                                     {'loss': 0.1991, 'learning_rate': 1.2741608645954084e-07, 'epoch': 0.95}
 95%|█████████▌| 3334/3507 [1:22:54<04:13,  1.46s/it]tensor([[ 0.2969,  3.4219,  4.2500,  0.3477, -0.7227]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.4062, -5.0000,  1.0938,  1.2891, -5.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5000, -2.4688,  2.8438,  1.6953, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:41,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.84 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8125, -5.7500, -2.0156,  2.1875, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3281,  0.4824,  2.5469, -1.6562, -3.6719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1250, -3.8125, -1.5156,  2.9219, -0.5195]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-4.0625, -3.8750, -0.9297,  1.8906, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -0.4766,  1.9141, -1.8438, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:41,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:07:41,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.76 | bwd_microstep: 208.01 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 206.90 | step_microstep: 1.93
[2025-11-06 19:07:41,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 371.62 | bwd: 208.89 | bwd_inner: 1.83 | bwd_allreduce: 206.94 | step: 2.01
 95%|█████████▌| 3335/3507 [1:22:55<03:28,  1.21s/it]                                                     {'loss': 0.9226, 'learning_rate': 1.2595041642201822e-07, 'epoch': 0.95}
 95%|█████████▌| 3335/3507 [1:22:55<03:28,  1.21s/it]tensor([[-4.2188, -4.9062, -2.4688,  2.0469, -1.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -4.3750,  1.0703,  3.2812, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.6562, -6.5312, -0.3770,  3.1094, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7188, -3.3281, -0.2197,  2.9688, -1.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -5.4062, -1.8750,  1.0312, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0938, -4.2812,  0.3633,  3.3125, -2.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1562, -4.5000,  0.4531,  1.7656, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:44,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.47 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.0000, -4.1875,  1.9141,  1.1328, -5.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:44,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 19:07:44,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 338.19 | bwd_microstep: 1.98 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.51
[2025-11-06 19:07:44,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 516.67 | bwd: 2.92 | bwd_inner: 1.82 | bwd_allreduce: 0.95 | step: 2.60
 95%|█████████▌| 3336/3507 [1:22:58<04:54,  1.72s/it]                                                     {'loss': 0.1231, 'learning_rate': 1.2449317161427942e-07, 'epoch': 0.95}
 95%|█████████▌| 3336/3507 [1:22:58<04:54,  1.72s/it]tensor([[-3.6250, -0.5781,  3.1562,  0.8633, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.6250,  0.3555,  3.4688,  1.2031, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2812, -4.4375,  0.1143,  2.8125, -2.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:44,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.54 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7812, -4.9062, -0.0957,  3.1562, -2.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9766,  0.8828,  3.5938,  1.1094, -1.9766]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.8125, -7.2188, -1.8203,  2.0000, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.0312, -3.7188,  2.2188,  0.2793, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312e+00, -2.1719e+00,  1.7031e+00, -7.2479e-04, -4.1250e+00]],
       device='cuda:1', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:45,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.24 | optimizer_step: 0.25
[2025-11-06 19:07:45,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.32 | bwd_microstep: 570.03 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 568.73 | step_microstep: 2.37
[2025-11-06 19:07:45,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.88 | bwd: 571.00 | bwd_inner: 2.07 | bwd_allreduce: 568.78 | step: 2.45
 95%|█████████▌| 3337/3507 [1:22:59<04:13,  1.49s/it]                                                     {'loss': 0.1086, 'learning_rate': 1.2304435327974873e-07, 'epoch': 0.95}
 95%|█████████▌| 3337/3507 [1:22:59<04:13,  1.49s/it]tensor([[-3.5781, -4.2812, -3.1406,  0.5391, -1.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -4.2500,  1.1406,  2.1094, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -3.9844,  0.5508,  2.6094, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3750, -3.1562,  1.0625,  2.8438, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9062, -3.1875,  2.0000,  1.3203, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -0.4648,  2.2969, -3.1562, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -4.5000,  0.4883,  1.8516, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:46,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.55 | bwd_microstep: 1.33 | bwd_inner_microstep: 1.22 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-8.5000, -7.0938, -2.7969, -0.6016, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:46,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:07:46,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.13 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.08
[2025-11-06 19:07:46,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.71 | bwd: 3.24 | bwd_inner: 2.30 | bwd_allreduce: 0.82 | step: 2.17
 95%|█████████▌| 3338/3507 [1:23:00<03:45,  1.34s/it]                                                     {'loss': 0.6008, 'learning_rate': 1.2160396265465835e-07, 'epoch': 0.95}
 95%|█████████▌| 3338/3507 [1:23:00<03:45,  1.34s/it]tensor([[-3.0156,  0.5586,  2.0625, -1.9062, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -4.1250, -0.4395,  3.2969, -1.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -5.2812, -1.7344,  1.9141, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6250, -0.4844,  2.2500, -0.6797, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:46,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.85 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
[19:07:46] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch12/Big_Harp_video_Live_at_KDHX_9_5_15.mp4, No such file or directory
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch12/Big_Harp_video_Live_at_KDHX_9_5_15.mp4... sharegpt4v_instruct_gpt4-vision_cap100k
Traceback (most recent call last):
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 718, in __getitem__
    ret=self.video_get_item(data_item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 604, in video_get_item
    image_list,frame_indices = self.load_video(video_path)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 582, in load_video
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/miniconda3/envs/visualquality/lib/python3.11/site-packages/decord/video_reader.py", line 57, in __init__
    raise RuntimeError("Error reading " + uri + "...")
RuntimeError: Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch12/Big_Harp_video_Live_at_KDHX_9_5_15.mp4...
tensor([[-4.0938, -4.1875, -0.9805,  2.8594, -1.4922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -4.4062,  0.5312,  3.1562, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2812, -4.2188, -0.6328,  3.1719, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5312, -2.9219,  1.9609, -0.8906, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:47,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.22 | optimizer_step: 0.24
[2025-11-06 19:07:47,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.15 | bwd_microstep: 950.28 | bwd_inner_microstep: 2.33 | bwd_allreduce_microstep: 947.77 | step_microstep: 2.45
[2025-11-06 19:07:47,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 409.04 | bwd: 951.01 | bwd_inner: 3.00 | bwd_allreduce: 947.79 | step: 2.55
 95%|█████████▌| 3339/3507 [1:23:01<03:47,  1.36s/it]                                                     {'loss': 0.4057, 'learning_rate': 1.2017200096805294e-07, 'epoch': 0.95}
 95%|█████████▌| 3339/3507 [1:23:01<03:47,  1.36s/it]tensor([[-6.2500, -4.7500,  0.0903,  1.6641, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -3.5781,  0.5938,  2.2188, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -2.9688,  1.2109,  0.8672, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:48,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.88 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.19
tensor([[-6.3750, -6.1250, -2.0781,  1.4141, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4688, -2.1562,  1.6484,  0.6328, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9062, -3.5312,  2.3906,  0.1011, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -2.3125,  1.9141,  0.1494, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656,  0.6562,  3.1250, -2.2344, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:07:49,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.63 | optimizer_gradients: 0.16 | optimizer_step: 0.15
[2025-11-06 19:07:49,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.32 | bwd_microstep: 1600.43 | bwd_inner_microstep: 1.20 | bwd_allreduce_microstep: 1599.13 | step_microstep: 2.03
[2025-11-06 19:07:49,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.21 | bwd: 1601.41 | bwd_inner: 2.06 | bwd_allreduce: 1599.18 | step: 2.23
 95%|█████████▌| 3340/3507 [1:23:03<04:16,  1.54s/it]                                                     {'loss': 0.3772, 'learning_rate': 1.1874846944177732e-07, 'epoch': 0.95}
 95%|█████████▌| 3340/3507 [1:23:03<04:16,  1.54s/it]tensor([[-6.3125, -4.2500,  0.8438,  1.5312, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:50,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.14 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.3438, -4.9375, -0.2891,  1.2500, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3125, -6.3750, -1.8906,  1.2969, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188, -1.4297,  1.5625, -0.5195, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -2.1250,  3.1094, -0.2432, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-10.0000,  -6.3750,   0.3340,  -1.3984,  -7.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5391,  2.0312,  1.6562, -3.0312, -2.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.9688, -2.3906,  1.5859,  0.4883, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:51,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:07:51,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.04 | bwd_microstep: 1202.93 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1201.75 | step_microstep: 1.88
[2025-11-06 19:07:51,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 477.21 | bwd: 1203.94 | bwd_inner: 2.01 | bwd_allreduce: 1201.78 | step: 1.95
 95%|█████████▌| 3341/3507 [1:23:05<04:24,  1.60s/it]                                                     {'loss': 0.5391, 'learning_rate': 1.1733336929049322e-07, 'epoch': 0.95}
 95%|█████████▌| 3341/3507 [1:23:05<04:24,  1.60s/it]tensor([[-2.9844,  0.7383,  1.7344, -2.8906, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.6094, -4.6875, -2.5000,  2.8281, -0.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:51,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.21 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.7969, -3.4844, -0.6328,  2.4531, -1.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -4.0625,  0.4609,  2.1875, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.2500, -1.1328,  3.6094, -0.4238, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.7500, -0.3770,  2.7031, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -3.2656,  1.2578,  1.6484, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4375, -1.6562,  3.1875, -0.0938, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:07:52,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:07:52,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.27 | bwd_microstep: 101.61 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 100.46 | step_microstep: 1.81
[2025-11-06 19:07:52,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.51 | bwd: 102.44 | bwd_inner: 1.81 | bwd_allreduce: 100.49 | step: 1.89
 95%|█████████▌| 3342/3507 [1:23:05<03:28,  1.26s/it]                                                     {'loss': 0.4972, 'learning_rate': 1.1592670172166032e-07, 'epoch': 0.95}
 95%|█████████▌| 3342/3507 [1:23:05<03:28,  1.26s/it]tensor([[-5.4062, -3.0625,  0.9883,  0.2949, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2188, -5.5000, -1.4375,  1.2969, -3.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:52,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.91 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.6250, -4.1250,  0.4648,  1.8047, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -3.0781,  2.0625, -0.3535, -5.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -0.7227,  3.7500, -0.2207, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219,  0.6680,  3.5312, -1.3906, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -3.0938, -1.1328,  1.6719, -1.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5000, -2.7031,  0.9688,  1.3594, -2.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:07:55,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.24 | optimizer_step: 0.34
[2025-11-06 19:07:55,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.75 | bwd_microstep: 2307.09 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 2305.77 | step_microstep: 2.52
[2025-11-06 19:07:55,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 326.68 | bwd: 2308.06 | bwd_inner: 2.03 | bwd_allreduce: 2305.83 | step: 2.64
 95%|█████████▌| 3343/3507 [1:23:09<05:00,  1.83s/it]                                                     {'loss': 0.3332, 'learning_rate': 1.1452846793554739e-07, 'epoch': 0.95}
 95%|█████████▌| 3343/3507 [1:23:09<05:00,  1.83s/it]tensor([[-6.2812, -5.5938, -1.2266,  1.5703, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000, -2.5156,  2.7812, -0.8125, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4062, -5.3750, -0.5430,  1.7656, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:55,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.65 | bwd_microstep: 2.40 | bwd_inner_microstep: 2.28 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.6562, -1.2891,  3.8750, -0.7109, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7656,  0.1074,  2.3594, -2.3594, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.3125, -2.7656,  1.5625,  0.3242, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -3.0938,  1.5781,  0.7617, -4.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7031, -3.0938,  1.1250,  4.2812, -1.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:55,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:07:55,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.94 | bwd_microstep: 189.85 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 188.83 | step_microstep: 2.13
[2025-11-06 19:07:55,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 330.61 | bwd: 192.25 | bwd_inner: 3.21 | bwd_allreduce: 188.88 | step: 2.21
 95%|█████████▌| 3344/3507 [1:23:09<03:57,  1.46s/it]                                                     {'loss': 0.1267, 'learning_rate': 1.1313866912522343e-07, 'epoch': 0.95}
 95%|█████████▌| 3344/3507 [1:23:09<03:57,  1.46s/it]tensor([[-5.5938, -1.7266,  2.9844, -0.7500, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4375, -3.8438,  2.2031,  2.1406, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -0.3301,  3.9688, -0.6484, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.6289,  2.8438,  2.4375, -2.4219, -1.9141]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:07:56,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.84 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9375, -3.2969, -0.1855,  0.6211, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656, -4.0625, -1.1016,  3.0312, -1.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6875, -3.2969, -1.7891,  1.8594, -0.4004]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-3.5469, -4.2812, -1.9297,  2.6406, -0.7891]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:57,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.19 | optimizer_step: 0.22
[2025-11-06 19:07:57,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.27 | bwd_microstep: 1067.66 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 1066.64 | step_microstep: 2.05
[2025-11-06 19:07:57,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 335.13 | bwd: 1068.31 | bwd_inner: 1.48 | bwd_allreduce: 1066.68 | step: 2.13
 95%|█████████▌| 3345/3507 [1:23:11<04:12,  1.56s/it]                                                     {'loss': 0.5742, 'learning_rate': 1.1175730647656313e-07, 'epoch': 0.95}
 95%|█████████▌| 3345/3507 [1:23:11<04:12,  1.56s/it]tensor([[-4.6875, -4.2812, -0.6250,  2.5469, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.5156, -0.2031,  1.0781,  0.0549, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:57,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.05 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.1250, -1.0781,  2.9531, -1.6406, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188, -0.2451,  2.8281, -1.7422, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -7.2500, -4.5000,  0.7148, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -1.8750,  2.2188,  0.9102, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0938, -2.9062,  1.1562,  0.6602, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.8125, -5.9062,  0.5000,  2.0312, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:07:58,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:07:58,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.12 | bwd_microstep: 139.31 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 138.16 | step_microstep: 1.69
[2025-11-06 19:07:58,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.21 | bwd: 140.14 | bwd_inner: 1.83 | bwd_allreduce: 138.19 | step: 1.77
 95%|█████████▌| 3346/3507 [1:23:12<03:20,  1.25s/it]                                                     {'loss': 0.3348, 'learning_rate': 1.1038438116824258e-07, 'epoch': 0.95}
 95%|█████████▌| 3346/3507 [1:23:12<03:20,  1.25s/it]tensor([[-4.6875, -1.6719,  1.8203, -0.1240, -3.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.7500, -4.5000, -0.6797,  3.0781, -2.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:07:58,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.48 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-2.7031, -3.3125, -1.4922,  2.6250, -0.2539]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.5938, -1.7969,  1.2109, -0.1777, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4688, -3.8906,  0.9531, -0.1621, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -5.1562, -1.3984,  3.2500, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.8750, -4.7812,  0.2412,  2.6562, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.5078,  2.0156,  2.8281, -0.9805, -2.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 19:08:00,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:08:00,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.37 | bwd_microstep: 169.47 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 168.43 | step_microstep: 2.08
[2025-11-06 19:08:00,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.88 | bwd: 170.33 | bwd_inner: 1.74 | bwd_allreduce: 168.47 | step: 2.17
 95%|█████████▌| 3347/3507 [1:23:14<04:03,  1.52s/it]                                                     {'loss': 0.6279, 'learning_rate': 1.0901989437173577e-07, 'epoch': 0.95}
 95%|█████████▌| 3347/3507 [1:23:14<04:03,  1.52s/it]tensor([[-6.1875, -3.9375,  1.6016,  1.9453, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -4.9062,  0.3770,  2.1094, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -2.5938,  1.5312,  0.6172, -3.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.2500,  1.4453,  4.0938,  2.3594, -1.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:00,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.46 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -4.6562, -0.1113,  2.4062, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -4.3438, -0.9609,  1.1406, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0234,  3.1719,  3.2656, -2.8594, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8828,  2.0312,  3.0000, -2.3594, -3.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:08:00,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:08:00,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 150.73 | bwd_microstep: 13.19 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 12.27 | step_microstep: 1.67
[2025-11-06 19:08:00,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.22 | bwd: 14.05 | bwd_inner: 1.63 | bwd_allreduce: 12.31 | step: 1.75
 95%|█████████▌| 3348/3507 [1:23:14<03:08,  1.19s/it]                                                     {'loss': 0.617, 'learning_rate': 1.0766384725131807e-07, 'epoch': 0.95}
 95%|█████████▌| 3348/3507 [1:23:14<03:08,  1.19s/it]tensor([[-1.6250, -2.6875, -1.8750,  2.3906,  0.6523]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 19:08:00,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.14 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.7969, -3.4062,  0.0359,  2.9844, -1.5234]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125, -1.6484,  2.1875,  1.9609, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094,  0.1621,  2.8125, -1.0391, -3.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -0.8047,  2.9062, -1.6562, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -1.7500,  2.2344,  0.2275, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -0.1895,  2.0156, -1.1797, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -4.5312, -0.4043,  2.1094, -2.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:08:03,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.44 | optimizer_step: 0.42
[2025-11-06 19:08:03,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.94 | bwd_microstep: 2488.31 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 2487.16 | step_microstep: 3.90
[2025-11-06 19:08:03,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 497.10 | bwd: 2489.23 | bwd_inner: 1.81 | bwd_allreduce: 2487.24 | step: 3.99
 95%|█████████▌| 3349/3507 [1:23:17<04:35,  1.74s/it]                                                     {'loss': 0.4161, 'learning_rate': 1.0631624096406612e-07, 'epoch': 0.95}
 95%|█████████▌| 3349/3507 [1:23:17<04:35,  1.74s/it]tensor([[-5.8750, -1.9297,  2.6250, -1.0703, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:03,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 134.69 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.1562, -2.7031,  1.9609,  2.0000, -3.5469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.1250,  1.1484,  3.0000,  0.1631, -2.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0625, -5.0312, -1.2812,  2.4844, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8359,  2.0469,  3.1094, -2.2344, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.4688,  0.9062,  3.8438, -1.8906, -4.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9062, -3.0312,  1.3906,  2.0938, -3.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.2188, -3.4844,  2.6562, -0.0659, -6.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:08:04,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:08:04,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.27 | bwd_microstep: 258.09 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 256.96 | step_microstep: 1.47
[2025-11-06 19:08:04,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 273.95 | bwd: 259.01 | bwd_inner: 1.85 | bwd_allreduce: 257.01 | step: 1.56
 96%|█████████▌| 3350/3507 [1:23:18<03:38,  1.39s/it]                                                     {'loss': 0.568, 'learning_rate': 1.0497707665985235e-07, 'epoch': 0.96}
 96%|█████████▌| 3350/3507 [1:23:18<03:38,  1.39s/it]tensor([[-5.4062, -2.2344,  2.4375,  0.2275, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -5.7188, -1.1328,  1.8672, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4688, -3.9062,  0.4609,  1.7109, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.4062, -4.8125,  1.3047,  1.2031, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:04,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.52 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.8438, -5.3438, -1.4453,  1.7578, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0312, -3.5000,  2.3906, -0.0188, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.2812, -3.0938,  1.6172, -0.1387, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -4.2812, -1.0078,  3.5156, -1.1016]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:08:06,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 19:08:06,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 88.18 | bwd_microstep: 1348.18 | bwd_inner_microstep: 3.68 | bwd_allreduce_microstep: 1344.40 | step_microstep: 2.01
[2025-11-06 19:08:06,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.73 | bwd: 1349.01 | bwd_inner: 4.41 | bwd_allreduce: 1344.45 | step: 2.10
 96%|█████████▌| 3351/3507 [1:23:19<03:49,  1.47s/it]                                                     {'loss': 0.5617, 'learning_rate': 1.036463554813416e-07, 'epoch': 0.96}
 96%|█████████▌| 3351/3507 [1:23:19<03:49,  1.47s/it]tensor([[-3.8125, -0.4883,  1.7188, -1.9531, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-9.0000, -7.4375, -1.7812,  0.2734, -5.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -7.2188, -4.2812,  0.5742, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.6250, -6.7812, -1.3594,  0.0371, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:06,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.40 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2188, -3.6250,  0.1807,  3.3750, -1.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1406,  1.1562,  1.0781, -2.6094, -2.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.0000, -3.7969,  0.5039,  2.5469, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.0938, -4.9375,  0.1128,  0.5273, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:06,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.69 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:08:06,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.32 | bwd_microstep: 2.72 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1.65 | step_microstep: 2.53
[2025-11-06 19:08:06,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.75 | bwd: 3.54 | bwd_inner: 1.68 | bwd_allreduce: 1.70 | step: 2.61
 96%|█████████▌| 3352/3507 [1:23:20<03:00,  1.16s/it]                                                     {'loss': 0.6874, 'learning_rate': 1.0232407856400007e-07, 'epoch': 0.96}
 96%|█████████▌| 3352/3507 [1:23:20<03:00,  1.16s/it]tensor([[-4.0312, -4.7812, -2.3438,  2.3594, -1.1406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.6406,  1.3203,  3.9375, -0.9648, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:06,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.40 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.4531, -0.7422,  1.8281,  0.0703, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -4.9062, -0.6211,  3.0469, -2.3594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3750, -5.0000, -0.1011, -0.1465, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -5.0625, -1.7734,  2.7344, -1.6953]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -1.3984,  2.5000,  0.7578, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -3.1562,  2.4531,  1.8203, -4.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:09,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.70 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 19:08:09,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.04 | bwd_microstep: 2630.34 | bwd_inner_microstep: 1.31 | bwd_allreduce_microstep: 2628.93 | step_microstep: 3.02
[2025-11-06 19:08:09,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.46 | bwd: 2631.27 | bwd_inner: 2.15 | bwd_allreduce: 2628.98 | step: 3.10
 96%|█████████▌| 3353/3507 [1:23:23<04:23,  1.71s/it]                                                     {'loss': 0.2118, 'learning_rate': 1.0101024703608741e-07, 'epoch': 0.96}
 96%|█████████▌| 3353/3507 [1:23:23<04:23,  1.71s/it]tensor([[-5.2812, -2.6406,  1.5938,  0.2012, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:09,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.67 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5625, -4.7188, -1.1250,  3.1250, -1.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -4.2188, -1.1094,  0.4277, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8750, -4.7500, -0.4961,  1.8516, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -0.6055,  3.8281, -1.1797, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3750, -3.8438,  0.3086,  1.1875, -3.4531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7500, -3.1406,  1.2422,  2.2969, -2.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6562, -1.3438,  1.5000,  0.2715, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:08:09,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:08:09,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.63 | bwd_microstep: 16.75 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 15.89 | step_microstep: 1.59
[2025-11-06 19:08:09,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 351.33 | bwd: 17.49 | bwd_inner: 1.41 | bwd_allreduce: 15.93 | step: 1.67
 96%|█████████▌| 3354/3507 [1:23:23<03:21,  1.32s/it]                                                     {'loss': 0.3295, 'learning_rate': 9.970486201865693e-08, 'epoch': 0.96}
 96%|█████████▌| 3354/3507 [1:23:23<03:21,  1.32s/it]tensor([[-4.6562, -2.4688,  0.4590, -0.6250, -3.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.8750, -1.0391,  2.6875,  1.3828, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:10,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.69 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[ 0.9375,  4.5000,  5.0625,  0.4844, -0.4902]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9688, -2.2188,  2.0156,  0.7070, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.1875, -3.7344,  1.3047,  1.1406, -4.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8750, -0.3867,  2.9062,  2.0156, -2.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8750, -3.1250,  0.5391,  1.1250, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -2.7812,  1.2734,  0.0535, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:08:11,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.32 | optimizer_step: 0.34
[2025-11-06 19:08:11,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.16 | bwd_microstep: 1399.15 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 1398.16 | step_microstep: 3.05
[2025-11-06 19:08:11,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.87 | bwd: 1399.91 | bwd_inner: 1.52 | bwd_allreduce: 1398.23 | step: 3.12
 96%|█████████▌| 3355/3507 [1:23:25<03:40,  1.45s/it]                                                     {'loss': 0.7405, 'learning_rate': 9.840792462555426e-08, 'epoch': 0.96}
 96%|█████████▌| 3355/3507 [1:23:25<03:40,  1.45s/it]tensor([[-2.9688, -0.5508,  3.1406,  1.8125, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4375, -1.4453,  3.5000, -0.5664, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3438, -4.4375,  1.2969,  0.4629, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.7969, -2.2344,  1.6328,  2.6094, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:12,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.74 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.6875, -3.9062,  1.1406,  0.1318, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -3.4531,  0.2246,  0.5352, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -4.7812, -1.3438,  3.4375, -1.3047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6875, -3.0000,  1.0391,  1.9141, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:12,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.52 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:08:12,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.73 | bwd_microstep: 1.62 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.77 | step_microstep: 2.14
[2025-11-06 19:08:12,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 399.48 | bwd: 2.60 | bwd_inner: 1.62 | bwd_allreduce: 0.82 | step: 2.24
 96%|█████████▌| 3356/3507 [1:23:26<03:28,  1.38s/it]                                                     {'loss': 0.4235, 'learning_rate': 9.711943596341644e-08, 'epoch': 0.96}
 96%|█████████▌| 3356/3507 [1:23:26<03:28,  1.38s/it]tensor([[-4.9375, -5.0312, -1.1797,  2.9531, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -4.7500, -2.3750,  0.7734, -2.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5625, -6.0625, -1.1484,  0.5273, -4.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:13,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.65 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.6250, -3.1562,  2.7500,  0.6172, -5.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.5156, -3.7656, -0.9648,  2.7812, -1.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -5.7188, -1.8984,  1.8281, -2.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3438, -3.7031,  0.7109,  4.0000, -1.8203]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0938, -4.0938,  1.6875,  2.5156, -3.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:16,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.34 | optimizer_step: 0.42
[2025-11-06 19:08:16,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.85 | bwd_microstep: 2793.03 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 2792.05 | step_microstep: 54.11
[2025-11-06 19:08:16,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 429.52 | bwd: 2793.72 | bwd_inner: 1.42 | bwd_allreduce: 2792.13 | step: 54.19
 96%|█████████▌| 3357/3507 [1:23:30<05:08,  2.06s/it]                                                     {'loss': 0.4252, 'learning_rate': 9.583939713167179e-08, 'epoch': 0.96}
 96%|█████████▌| 3357/3507 [1:23:30<05:08,  2.06s/it]tensor([[-3.1406, -4.1875, -2.3125,  2.9062, -0.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5625, -3.7344,  0.2598,  2.5781, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -1.7578,  2.3750,  0.8047, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7812, -4.5938,  0.0938,  0.2559, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:16,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.12 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.5312, -4.5625,  0.5547,  1.2891, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3750, -0.9922,  2.5625, -0.1953, -3.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8438, -3.1562,  0.9844,  2.0625, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0312, -4.1875,  2.0312,  1.2734, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:17,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 19:08:17,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.58 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.91 | step_microstep: 2.31
[2025-11-06 19:08:17,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 541.72 | bwd: 2.92 | bwd_inner: 1.81 | bwd_allreduce: 0.95 | step: 2.39
 96%|█████████▌| 3358/3507 [1:23:30<04:01,  1.62s/it]                                                     {'loss': 0.3624, 'learning_rate': 9.456780922253995e-08, 'epoch': 0.96}
 96%|█████████▌| 3358/3507 [1:23:30<04:01,  1.62s/it]tensor([[-8.9375, -6.2812, -1.3125, -1.8203, -6.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3125, -3.9844, -0.3320,  3.0312, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:17,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.98 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.2812, -3.1875,  0.5234,  2.2656, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0391,  1.2109,  1.5234, -0.2773, -1.1797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.7500, -5.3125, -0.2773,  1.4062, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1094, -2.1094,  2.1875,  4.5938, -1.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -2.8438,  0.7461,  1.4922, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -0.9922,  3.0156, -1.0781, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:19,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.45 | optimizer_gradients: 0.20 | optimizer_step: 0.23
[2025-11-06 19:08:19,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.98 | bwd_microstep: 2266.03 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 2264.87 | step_microstep: 2.51
[2025-11-06 19:08:19,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.98 | bwd: 2266.91 | bwd_inner: 1.84 | bwd_allreduce: 2264.92 | step: 2.59
 96%|█████████▌| 3359/3507 [1:23:33<04:44,  1.92s/it]                                                     {'loss': 0.9345, 'learning_rate': 9.330467332102855e-08, 'epoch': 0.96}
 96%|█████████▌| 3359/3507 [1:23:33<04:44,  1.92s/it]tensor([[-5.6250, -2.7500,  1.5547, -0.0437, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688,  0.2373,  1.5234, -1.5703, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.5000, -6.7188, -0.2490,  1.5078, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5156, -1.5156,  1.1406,  2.6875, -1.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -2.8750,  1.0391,  2.4688, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -0.6523,  2.7656, -2.2031, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7188, -6.2812, -1.8750,  1.8672, -3.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:21,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.04 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.4062, -3.5625, -0.9727,  2.6875, -0.9961]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:21,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:08:21,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.91 | bwd_microstep: 1.94 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.07
[2025-11-06 19:08:21,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.96 | bwd: 3.02 | bwd_inner: 1.94 | bwd_allreduce: 0.91 | step: 2.17
 96%|█████████▌| 3360/3507 [1:23:35<04:38,  1.89s/it]                                                     {'loss': 0.1537, 'learning_rate': 9.204999050493213e-08, 'epoch': 0.96}
 96%|█████████▌| 3360/3507 [1:23:35<04:38,  1.89s/it]tensor([[-4.9688, -2.5156,  1.0781,  0.0167, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:21,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.23 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.3125, -4.0312, -0.8281,  0.2930, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5156,  1.2266,  3.2344, -0.5898, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9688, -2.1250,  2.5938, -0.8594, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2812, -4.0938, -2.7969,  1.4844, -0.6289]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.0781,  2.0156,  2.2500, -1.1328, -1.7266]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1250, -5.1250, -1.4141,  2.8125, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -2.2812,  2.9844,  0.6367, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:22,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:08:22,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 199.54 | bwd_microstep: 74.97 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 74.09 | step_microstep: 2.25
[2025-11-06 19:08:22,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.81 | bwd: 75.75 | bwd_inner: 1.44 | bwd_allreduce: 74.14 | step: 2.34
 96%|█████████▌| 3361/3507 [1:23:35<03:34,  1.47s/it]                                                     {'loss': 0.5072, 'learning_rate': 9.080376184483653e-08, 'epoch': 0.96}
 96%|█████████▌| 3361/3507 [1:23:35<03:34,  1.47s/it]tensor([[-4.1562,  0.4805,  4.0000, -1.8203, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -3.7656,  0.5625,  2.3906, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -1.6797,  2.0000,  0.5469, -3.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4062, -2.8906,  2.0312,  1.5469, -3.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -3.2344,  0.5898,  1.4297, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:22,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.41 | bwd_microstep: 0.71 | bwd_inner_microstep: 0.60 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.9375, -6.2188, -0.3340,  1.4688, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.4531, -3.4531, -2.2969,  2.1875,  0.0422]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -4.3438, -0.8164,  2.7031, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:23,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:08:23,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.33 | bwd_microstep: 844.43 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 843.57 | step_microstep: 2.43
[2025-11-06 19:08:23,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.77 | bwd: 845.14 | bwd_inner: 1.38 | bwd_allreduce: 843.62 | step: 2.51
 96%|█████████▌| 3362/3507 [1:23:37<03:42,  1.53s/it]                                                     {'loss': 0.4435, 'learning_rate': 8.95659884041089e-08, 'epoch': 0.96}
 96%|█████████▌| 3362/3507 [1:23:37<03:42,  1.53s/it]tensor([[-1.5469,  1.8594,  1.8281, -2.2500, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:23,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 105.70 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5000, -3.2969,  1.0312,  1.2188, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0156, -0.9414,  3.0938,  3.2031, -1.8516]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8125, -4.7188, -1.0625,  2.7500, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3125,  0.2363,  2.8750, -0.3574, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.5469,  2.6094,  3.1250, -2.7812, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -0.4141,  1.6094, -1.0469, -3.4531]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.2500, -3.8750, -0.5312,  2.1719, -2.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:08:25,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:08:25,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.18 | bwd_microstep: 1251.61 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 1250.71 | step_microstep: 2.14
[2025-11-06 19:08:25,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 280.90 | bwd: 1252.29 | bwd_inner: 1.40 | bwd_allreduce: 1250.75 | step: 2.22
 96%|█████████▌| 3363/3507 [1:23:39<03:42,  1.55s/it]                                                     {'loss': 0.6293, 'learning_rate': 8.833667123890444e-08, 'epoch': 0.96}
 96%|█████████▌| 3363/3507 [1:23:39<03:42,  1.55s/it]tensor([[-6.7500, -4.6250,  0.8984,  1.8047, -4.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.5312,  1.0156,  3.0312, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2188, -4.7188,  1.6406,  1.8672, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8438, -2.5312,  1.6016,  3.0156, -2.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -5.4062, -1.7656,  2.0156, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:26,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.01 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.8125, -4.7188,  0.2598,  2.5312, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7500, -4.5312, -2.0469,  2.6250, -0.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -1.7031,  2.7344, -0.8789, -5.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:27,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.68 | optimizer_gradients: 0.16 | optimizer_step: 0.26
[2025-11-06 19:08:27,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.18 | bwd_microstep: 2.27 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1.05 | step_microstep: 2.91
[2025-11-06 19:08:27,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 468.23 | bwd: 3.21 | bwd_inner: 1.96 | bwd_allreduce: 1.09 | step: 2.99
 96%|█████████▌| 3364/3507 [1:23:40<03:49,  1.61s/it]                                                     {'loss': 0.6438, 'learning_rate': 8.711581139816294e-08, 'epoch': 0.96}
 96%|█████████▌| 3364/3507 [1:23:40<03:49,  1.61s/it]tensor([[-3.6562, -4.1875, -1.6719,  2.5938, -0.9766]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8750, -0.2461,  4.0312, -1.3281, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:27,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.54 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-1.7734,  1.2734,  1.7266, -1.5156, -2.2344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844,  0.4102,  3.8594, -0.4531, -3.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7109, -2.3125, -1.6719,  1.3359,  0.1553]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.7812, -4.3125, -0.6172,  2.4375, -2.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.9062, -4.0625,  1.9688,  1.2734, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -2.1719,  2.5469,  1.0391, -4.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:28,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:08:28,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.16 | bwd_microstep: 762.41 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 761.32 | step_microstep: 1.84
[2025-11-06 19:08:28,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 452.69 | bwd: 763.39 | bwd_inner: 1.90 | bwd_allreduce: 761.36 | step: 1.91
 96%|█████████▌| 3365/3507 [1:23:42<03:49,  1.62s/it]                                                     {'loss': 0.8262, 'learning_rate': 8.59034099236078e-08, 'epoch': 0.96}
 96%|█████████▌| 3365/3507 [1:23:42<03:49,  1.62s/it]tensor([[-4.0938, -4.8750, -3.2031,  1.1719, -1.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4062, -3.4062,  0.2061,  1.7031, -2.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.7422,  1.5547,  2.3906, -1.9297, -2.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0312, -2.9688,  3.0312, -0.4414, -6.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8750, -1.7812,  1.8594, -0.3066, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:29,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.83 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-6.4688, -5.6562, -1.3594,  1.3672, -3.7344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3750, -1.9297,  3.5781, -1.1953, -6.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -5.0312, -2.2500,  2.1719, -1.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:31,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.20 | optimizer_step: 0.22
[2025-11-06 19:08:31,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.74 | bwd_microstep: 1322.10 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 1320.59 | step_microstep: 2.67
[2025-11-06 19:08:31,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.61 | bwd: 1323.03 | bwd_inner: 2.21 | bwd_allreduce: 1320.66 | step: 2.79
 96%|█████████▌| 3366/3507 [1:23:45<04:37,  1.97s/it]                                                     {'loss': 0.124, 'learning_rate': 8.469946784974481e-08, 'epoch': 0.96}
 96%|█████████▌| 3366/3507 [1:23:45<04:37,  1.97s/it]tensor([[-7.3125, -5.4688,  0.2432,  1.4531, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4688, -3.4219,  1.2812, -0.4727, -5.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -1.4531,  2.6406, -0.4805, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:31,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.32 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.68 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.1250, -0.9766,  1.4453, -1.6172, -3.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4375, -4.3438, -0.9766,  2.8281, -1.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5625, -3.7188,  0.1680,  2.6406, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -4.5938, -1.6172,  2.3281, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -0.2070,  3.9219, -1.1953, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:08:31,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:08:31,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.56 | bwd_microstep: 36.06 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 34.99 | step_microstep: 2.00
[2025-11-06 19:08:31,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 353.92 | bwd: 36.82 | bwd_inner: 1.68 | bwd_allreduce: 35.02 | step: 2.07
 96%|█████████▌| 3367/3507 [1:23:45<03:30,  1.50s/it]                                                     {'loss': 0.253, 'learning_rate': 8.350398620386113e-08, 'epoch': 0.96}
 96%|█████████▌| 3367/3507 [1:23:45<03:30,  1.50s/it]tensor([[-6.0000, -3.6875,  0.6523,  0.3398, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562e+00, -4.0625e+00, -4.9744e-03,  2.2656e+00, -2.8438e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -4.5312,  0.6211,  2.7812, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:32,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0625, -1.0312,  3.1719, -0.6172, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.7188, -3.7344,  1.8125,  0.6875, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([3], device='cuda:1')
tensor([[-4.1562, -1.5781,  0.7461, -1.1016, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.1797,  2.4219,  3.6562, -0.6836, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-2.9375,  0.7617,  3.1406, -0.6250, -3.2031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:08:33,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:08:33,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.72 | bwd_microstep: 836.53 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 835.32 | step_microstep: 1.59
[2025-11-06 19:08:33,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 328.34 | bwd: 837.34 | bwd_inner: 1.83 | bwd_allreduce: 835.36 | step: 1.67
 96%|█████████▌| 3368/3507 [1:23:46<03:16,  1.41s/it]                                                     {'loss': 0.5502, 'learning_rate': 8.231696600602523e-08, 'epoch': 0.96}
 96%|█████████▌| 3368/3507 [1:23:46<03:16,  1.41s/it]tensor([[-7.0312, -3.4531,  1.5859, -1.1172, -5.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9531,  1.3047,  1.4219, -2.3750, -2.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -1.5312,  1.7344,  0.5508, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:33,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.67 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2812, -5.3125, -1.8750,  2.1719, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9844, -1.1797,  2.5781,  3.0781, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2031, -2.8750, -0.6562,  3.8594,  0.2559]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9844,  1.8203,  2.4219, -2.6719, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -2.9688,  2.6250,  1.2422, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:34,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 19:08:34,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 82.76 | bwd_microstep: 2.34 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1.05 | step_microstep: 2.29
[2025-11-06 19:08:34,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 238.42 | bwd: 3.33 | bwd_inner: 2.10 | bwd_allreduce: 1.08 | step: 2.37
 96%|█████████▌| 3369/3507 [1:23:48<03:05,  1.34s/it]                                                     {'loss': 0.3486, 'learning_rate': 8.113840826908582e-08, 'epoch': 0.96}
 96%|█████████▌| 3369/3507 [1:23:48<03:05,  1.34s/it]tensor([[-4.7188, -4.2812,  0.1206,  3.7031, -1.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:34,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.41 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.5312, -1.0547,  3.6094, -1.6172, -5.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -2.9062,  1.0625,  1.4375, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5625, -4.7188, -1.4453,  2.8906, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.3438, -1.1719,  1.5078, -0.7070, -3.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -5.6562, -1.2422,  1.8594, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9062, -4.4688, -0.5781,  2.6250, -2.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -3.3750,  0.5078,  0.5000, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:35,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:08:35,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.24 | bwd_microstep: 568.76 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 567.52 | step_microstep: 1.94
[2025-11-06 19:08:35,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.68 | bwd: 569.55 | bwd_inner: 1.82 | bwd_allreduce: 567.58 | step: 2.05
 96%|█████████▌| 3370/3507 [1:23:49<02:48,  1.23s/it]                                                     {'loss': 0.1985, 'learning_rate': 7.996831399867067e-08, 'epoch': 0.96}
 96%|█████████▌| 3370/3507 [1:23:49<02:48,  1.23s/it]tensor([[-3.4688,  1.0156,  3.8594, -1.8750, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3125, -4.3438, -1.2344,  2.6094, -1.6172]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.3125, -3.8125,  0.1895,  3.5000, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:35,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.97 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.4062, -1.4766,  2.9219, -1.3359, -5.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -1.7500,  2.9688, -1.7422, -5.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7812, -2.7188,  1.1406,  1.2188, -3.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -3.0312,  2.7500,  1.0156, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1875,  0.2812,  2.8125, -0.9219, -3.3906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:36,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:08:36,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.64 | bwd_microstep: 523.36 | bwd_inner_microstep: 1.40 | bwd_allreduce_microstep: 521.86 | step_microstep: 1.97
[2025-11-06 19:08:36,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.63 | bwd: 524.26 | bwd_inner: 2.19 | bwd_allreduce: 521.92 | step: 2.06
 96%|█████████▌| 3371/3507 [1:23:49<02:32,  1.12s/it]                                                     {'loss': 0.1503, 'learning_rate': 7.88066841931856e-08, 'epoch': 0.96}
 96%|█████████▌| 3371/3507 [1:23:49<02:32,  1.12s/it]tensor([[-3.7656, -4.2812, -1.4375,  3.2031, -0.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -3.6719,  0.6992,  3.0938, -2.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0938, -3.3906,  0.5234,  3.5156, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.9375, -6.4062, -0.0088,  2.6250, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1875, -0.4688,  3.5312, -1.9922, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:36,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 103.44 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-1.3906,  2.0469,  2.0156, -2.0938, -2.2344]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-6.5938, -4.5312,  0.1836,  0.4902, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9375, -4.1875,  1.5078,  0.7227, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:39,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.42 | optimizer_gradients: 0.21 | optimizer_step: 0.31
[2025-11-06 19:08:39,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.56 | bwd_microstep: 2050.96 | bwd_inner_microstep: 1.23 | bwd_allreduce_microstep: 2049.64 | step_microstep: 2.44
[2025-11-06 19:08:39,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 291.00 | bwd: 2052.14 | bwd_inner: 2.28 | bwd_allreduce: 2049.70 | step: 2.53
 96%|█████████▌| 3372/3507 [1:23:53<03:50,  1.71s/it]                                                     {'loss': 0.3757, 'learning_rate': 7.765351984381663e-08, 'epoch': 0.96}
 96%|█████████▌| 3372/3507 [1:23:53<03:50,  1.71s/it]tensor([[-6.5938, -5.0625,  0.3594,  2.2656, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:39,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.77 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.92 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.7500, -4.5312,  1.3828,  1.9844, -4.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.7812, -4.4688, -0.3516,  1.1875, -3.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6250, -5.1562,  0.5859,  2.6562, -3.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.7969,  1.4141,  1.3750, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6562, -3.4219,  2.1875,  0.3652, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656,  0.2695,  2.7969, -1.0312, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-0.6445,  3.0000,  2.6875, -2.1562, -1.8984]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:39,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 19:08:39,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.55 | bwd_microstep: 151.13 | bwd_inner_microstep: 1.84 | bwd_allreduce_microstep: 149.18 | step_microstep: 2.10
[2025-11-06 19:08:39,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 308.33 | bwd: 152.17 | bwd_inner: 2.78 | bwd_allreduce: 149.23 | step: 2.18
 96%|█████████▌| 3373/3507 [1:23:53<03:00,  1.35s/it]                                                     {'loss': 0.4162, 'learning_rate': 7.650882193452114e-08, 'epoch': 0.96}
 96%|█████████▌| 3373/3507 [1:23:53<03:00,  1.35s/it]tensor([[-4.8125, -4.0625, -0.5664,  1.8281, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:39,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.93 | bwd_microstep: 7.11 | bwd_inner_microstep: 6.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.0938, -0.7188,  3.2969, -1.7500, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1562, -4.9062, -0.8789,  2.9531, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.4844,  1.0391,  3.1562, -1.3984, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7812, -4.0000,  0.4727,  1.3984, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.3438, -4.7812,  1.4922,  1.6172, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.6094,  0.4531,  3.2500, -1.8984, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -3.6562,  0.7188,  3.6406, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:41,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.28 | optimizer_step: 0.30
[2025-11-06 19:08:41,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.24 | bwd_microstep: 1288.24 | bwd_inner_microstep: 5.50 | bwd_allreduce_microstep: 1282.62 | step_microstep: 2.98
[2025-11-06 19:08:41,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 301.20 | bwd: 1295.35 | bwd_inner: 12.50 | bwd_allreduce: 1282.68 | step: 3.06
 96%|█████████▌| 3374/3507 [1:23:55<03:10,  1.43s/it]                                                     {'loss': 0.1726, 'learning_rate': 7.53725914420378e-08, 'epoch': 0.96}
 96%|█████████▌| 3374/3507 [1:23:55<03:10,  1.43s/it]tensor([[-2.8906, -3.6406, -2.8594,  0.6406, -0.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3750, -4.4375, -1.1797,  2.7031, -1.7109]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7188, -1.9844,  2.6406,  1.7578, -3.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:41,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.67 | bwd_microstep: 0.68 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.8125, -5.8750, -0.6680,  2.1250, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1250, -1.2188,  1.9141,  0.0125, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4062, -1.3672,  3.2344, -0.5625, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750,  0.8789,  4.7812, -2.3906, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.2500, -4.1875, -0.6758,  3.1094, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:41,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:08:41,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.97 | bwd_microstep: 74.34 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 73.41 | step_microstep: 1.69
[2025-11-06 19:08:41,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.65 | bwd: 75.02 | bwd_inner: 1.43 | bwd_allreduce: 73.45 | step: 1.77
 96%|█████████▌| 3375/3507 [1:23:55<02:30,  1.14s/it]                                                     {'loss': 0.7147, 'learning_rate': 7.424482933587774e-08, 'epoch': 0.96}
 96%|█████████▌| 3375/3507 [1:23:55<02:30,  1.14s/it]tensor([[-4.2188, -4.1875, -0.7812,  3.0000, -1.5078]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:41,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.89 | bwd_microstep: 0.73 | bwd_inner_microstep: 0.64 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.3281, -4.0000, -0.9453,  3.9531, -0.4336]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -1.6641,  2.7031, -0.2344, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7188, -2.6250,  2.1719, -0.0967, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0000, -0.0525,  2.7969, -1.3984, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -4.7188, -1.6016,  2.4062, -1.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[h264 @ 0xc231340] mmco: unref short failure
[h264 @ 0xc231340] mmco: unref short failure
tensor([[-2.3750,  1.0938,  1.9531, -2.1562, -2.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5938, -4.0938,  0.2305, -0.4355, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:44,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.36 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:08:44,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.72 | bwd_microstep: 2455.13 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 2454.01 | step_microstep: 1.92
[2025-11-06 19:08:44,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.55 | bwd: 2455.86 | bwd_inner: 1.69 | bwd_allreduce: 2454.04 | step: 1.99
 96%|█████████▋| 3376/3507 [1:23:58<03:34,  1.64s/it]                                                     {'loss': 0.2259, 'learning_rate': 7.312553657832567e-08, 'epoch': 0.96}
 96%|█████████▋| 3376/3507 [1:23:58<03:34,  1.64s/it]tensor([[-2.9219, -4.0625, -3.0938,  1.5000, -0.2949]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -3.9688,  0.7930,  0.9180, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:44,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.31 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0938, -2.0156,  2.1094,  2.1094, -2.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -4.4375, -0.5508,  2.0000, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -3.2344,  2.0000,  0.1045, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438,  1.7344,  3.3750, -2.9062, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.5625, -3.4844, -0.1973, -2.3906, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7969,  0.4688,  2.7344, -2.4219, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:08:45,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.22 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:08:45,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.54 | bwd_microstep: 114.22 | bwd_inner_microstep: 1.48 | bwd_allreduce_microstep: 112.63 | step_microstep: 1.54
[2025-11-06 19:08:45,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.87 | bwd: 115.14 | bwd_inner: 2.32 | bwd_allreduce: 112.67 | step: 1.62
 96%|█████████▋| 3377/3507 [1:23:58<02:49,  1.30s/it]                                                     {'loss': 0.4781, 'learning_rate': 7.201471412443983e-08, 'epoch': 0.96}
 96%|█████████▋| 3377/3507 [1:23:58<02:49,  1.30s/it]tensor([[-2.3750, -3.3594, -2.3281,  1.9453,  0.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3750, -4.4375,  0.3184,  1.2188, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:45,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.42 | bwd_microstep: 1.03 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-0.0510,  2.0938,  4.1250,  3.3438,  0.1328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.7500,  1.0781,  2.6719, -1.7266, -3.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.6875, -0.5508,  3.5312, -0.8672, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0000, -0.8555,  3.2656, -1.3750, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5312, -1.7188,  0.8711,  0.4766, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -3.2188,  1.6562,  3.3125, -2.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:08:46,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.21 | optimizer_step: 0.22
[2025-11-06 19:08:46,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.93 | bwd_microstep: 650.05 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 648.92 | step_microstep: 2.28
[2025-11-06 19:08:46,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.37 | bwd: 651.08 | bwd_inner: 1.98 | bwd_allreduce: 648.97 | step: 2.37
 96%|█████████▋| 3378/3507 [1:23:59<02:38,  1.22s/it]                                                     {'loss': 0.6066, 'learning_rate': 7.091236292205317e-08, 'epoch': 0.96}
 96%|█████████▋| 3378/3507 [1:23:59<02:38,  1.22s/it]tensor([[-3.8906, -0.1143,  3.4688, -0.2383, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -1.4219,  1.8516, -1.4688, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -2.8125,  0.9102,  1.3750, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -3.3750,  1.2031,  2.6875, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:46,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.43 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.0938, -4.5938, -2.3906,  1.7891, -1.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1250, -1.4531,  1.8594, -1.6328, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0000, -5.0938, -0.3574,  2.6875, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5000, -3.4531,  2.4375,  1.0938, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:08:46,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 19:08:46,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.41 | bwd_microstep: 40.50 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 39.45 | step_microstep: 1.80
[2025-11-06 19:08:46,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.85 | bwd: 41.40 | bwd_inner: 1.73 | bwd_allreduce: 39.50 | step: 1.89
 96%|█████████▋| 3379/3507 [1:24:00<02:06,  1.01it/s]                                                     {'loss': 0.3425, 'learning_rate': 6.981848391176771e-08, 'epoch': 0.96}
 96%|█████████▋| 3379/3507 [1:24:00<02:06,  1.01it/s]tensor([[-2.5000, -0.1118,  1.5469, -0.1943, -2.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.4531, -3.0000, -1.7734,  1.8438, -0.2227]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:46,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 140.96 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.7188, -4.8125, -1.2344,  3.0625, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9375,  0.7852,  2.8281, -1.5781, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.4688, -3.6875, -0.4648,  1.7188, -2.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.2188, -4.7500,  0.1729,  1.9688, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -5.0000, -1.1172,  3.3125, -1.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -0.7344,  2.0000, -1.0625, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:08:49,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.21 | optimizer_step: 0.33
[2025-11-06 19:08:49,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.73 | bwd_microstep: 2074.52 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 2073.38 | step_microstep: 2.56
[2025-11-06 19:08:49,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 400.72 | bwd: 2075.39 | bwd_inner: 1.80 | bwd_allreduce: 2073.43 | step: 2.65
 96%|█████████▋| 3380/3507 [1:24:03<03:17,  1.55s/it]                                                     {'loss': 0.1342, 'learning_rate': 6.873307802695795e-08, 'epoch': 0.96}
 96%|█████████▋| 3380/3507 [1:24:03<03:17,  1.55s/it]tensor([[-4.2500, -0.8945,  2.8906,  0.1094, -3.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3125, -3.5938,  0.5391,  1.4922, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.5469,  0.2070,  1.7969,  0.0630, -2.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:08:49,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.87 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0312, -3.7500,  1.4688, -0.3965, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -0.7422,  3.2031, -1.3516, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -0.1367,  2.5312, -2.2969, -4.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.0625, -4.7812, -0.6758,  3.1094, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.0000, -3.7656,  1.6875,  1.9453, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:49,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.56 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:08:49,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.78 | bwd_microstep: 72.06 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 71.25 | step_microstep: 2.16
[2025-11-06 19:08:49,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 282.65 | bwd: 72.91 | bwd_inner: 1.48 | bwd_allreduce: 71.29 | step: 2.24
 96%|█████████▋| 3381/3507 [1:24:03<02:35,  1.23s/it]                                                     {'loss': 0.7865, 'learning_rate': 6.765614619376859e-08, 'epoch': 0.96}
 96%|█████████▋| 3381/3507 [1:24:03<02:35,  1.23s/it]tensor([[-4.1562, -3.8125, -0.6289,  2.3125, -1.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3125, -2.7969,  1.4141,  0.9531, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:50,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.73 | bwd_microstep: 0.63 | bwd_inner_microstep: 0.53 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.1875, -5.4688,  0.0698,  3.9219, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7500, -3.2969,  2.2500,  2.0625, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -3.7969,  0.7500,  1.8828, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8906,  1.1797,  2.8906, -2.0469, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125,  0.2578,  3.8125, -1.8672, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-9.7500, -7.6250, -2.1562, -1.2656, -6.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:08:52,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.76 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:08:52,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.62 | bwd_microstep: 1272.91 | bwd_inner_microstep: 5.01 | bwd_allreduce_microstep: 1267.78 | step_microstep: 2.35
[2025-11-06 19:08:52,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.37 | bwd: 1273.54 | bwd_inner: 5.56 | bwd_allreduce: 1267.83 | step: 2.43
 96%|█████████▋| 3382/3507 [1:24:06<03:12,  1.54s/it]                                                     {'loss': 0.3902, 'learning_rate': 6.658768933111238e-08, 'epoch': 0.96}
 96%|█████████▋| 3382/3507 [1:24:06<03:12,  1.54s/it]tensor([[-6.9375, -2.3906,  2.6562, -2.2812, -6.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.7500, -5.8125, -0.8750, -0.0513, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:52,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.58 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.0000, -5.2188, -0.1475,  3.4062, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -3.3750,  2.0469,  2.6094, -3.6094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4844,  1.3438,  3.2812, -1.5781, -3.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7500, -3.4219,  2.4219,  2.5938, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9375, -0.9141,  2.1719, -2.0625, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.2812, -6.5625, -2.4062,  2.4531, -2.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:08:54,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:08:54,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.64 | bwd_microstep: 1817.63 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 1816.49 | step_microstep: 2.07
[2025-11-06 19:08:54,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 331.24 | bwd: 1818.56 | bwd_inner: 1.84 | bwd_allreduce: 1816.55 | step: 2.17
 96%|█████████▋| 3383/3507 [1:24:08<03:35,  1.73s/it]                                                     {'loss': 0.6891, 'learning_rate': 6.552770835067224e-08, 'epoch': 0.96}
 96%|█████████▋| 3383/3507 [1:24:08<03:35,  1.73s/it]tensor([[-4.4688, -4.3438, -0.5078,  3.3906, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:54,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.57 | bwd_microstep: 6.19 | bwd_inner_microstep: 6.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6562, -4.4688, -0.3516,  3.6406, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -3.2188,  1.5938,  3.4062, -2.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.4688, -1.3750,  1.8594, -0.7695, -4.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.9375, -6.7500, -2.5156, -0.4902, -5.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6094, -0.4609,  2.6094,  0.2598, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.2500, -7.5000, -2.5781,  0.7812, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -4.1562, -0.0356,  2.6406, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:08:55,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 19:08:55,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.24 | bwd_microstep: 992.59 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 991.69 | step_microstep: 1.92
[2025-11-06 19:08:55,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.84 | bwd: 998.78 | bwd_inner: 6.87 | bwd_allreduce: 991.75 | step: 2.00
 96%|█████████▋| 3384/3507 [1:24:09<03:17,  1.61s/it]                                                     {'loss': 0.3139, 'learning_rate': 6.447620415689693e-08, 'epoch': 0.96}
 96%|█████████▋| 3384/3507 [1:24:09<03:17,  1.61s/it]tensor([[-3.0000, -1.4375,  1.6250,  2.0156, -1.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -2.9844,  2.0781,  2.7812, -3.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -0.7305,  2.2344, -1.1328, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:08:55,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 198.62 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.5625, -2.2188,  3.0625,  0.8320, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.6875, -0.6250,  2.2656, -0.0325, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.5469,  0.2695,  1.3906, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.5391,  2.3750,  3.3438,  0.2334, -1.1172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.7188, -3.5938,  0.3359,  0.2432, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:08:56,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.21 | optimizer_step: 0.29
[2025-11-06 19:08:56,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.58 | bwd_microstep: 168.99 | bwd_inner_microstep: 2.79 | bwd_allreduce_microstep: 166.10 | step_microstep: 2.26
[2025-11-06 19:08:56,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 381.20 | bwd: 169.73 | bwd_inner: 3.43 | bwd_allreduce: 166.15 | step: 2.35
 97%|█████████▋| 3385/3507 [1:24:10<02:50,  1.40s/it]                                                     {'loss': 0.6291, 'learning_rate': 6.34331776470054e-08, 'epoch': 0.97}
 97%|█████████▋| 3385/3507 [1:24:10<02:50,  1.40s/it]tensor([[-5.0000, -5.2500, -1.9609,  2.3281, -2.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:08:56,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.98 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.6250, -4.4062, -0.2246,  3.5781, -1.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7031, -4.1562, -1.9922,  2.1875, -1.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8125, -0.6797,  2.2812,  1.6094, -2.0156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2344, -2.2188,  0.8438,  4.3125, -0.0649]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1328,  2.1562,  4.0938,  0.6602, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.6250, -3.3906,  1.8438, -0.2178, -5.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5938, -4.7812, -0.5664,  1.8516, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:09:00,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.86 | optimizer_gradients: 0.22 | optimizer_step: 0.23
[2025-11-06 19:09:00,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.07 | bwd_microstep: 1960.43 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 1959.44 | step_microstep: 3.07
[2025-11-06 19:09:00,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.08 | bwd: 1961.27 | bwd_inner: 1.61 | bwd_allreduce: 1959.50 | step: 3.17
 97%|█████████▋| 3386/3507 [1:24:14<04:23,  2.18s/it]                                                     {'loss': 0.1256, 'learning_rate': 6.239862971097909e-08, 'epoch': 0.97}
 97%|█████████▋| 3386/3507 [1:24:14<04:23,  2.18s/it]tensor([[-4.5312, -3.5000,  0.3242,  2.4531, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8750, -2.6719,  0.2021,  1.8906, -2.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8125, -5.1250, -1.0000,  2.0625, -3.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -1.3750,  2.7188, -0.0127, -4.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:00,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.66 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3125,  0.3887,  3.1406, -1.3516, -3.7031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3125, -3.2500,  1.1641,  1.8594, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6562, -1.2891,  3.4375, -1.3828, -5.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -0.9805,  4.0000, -0.9961, -5.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:01,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.46 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:09:01,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 334.88 | bwd_microstep: 2.28 | bwd_inner_microstep: 1.41 | bwd_allreduce_microstep: 0.80 | step_microstep: 1.86
[2025-11-06 19:09:01,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 612.57 | bwd: 3.28 | bwd_inner: 2.32 | bwd_allreduce: 0.84 | step: 1.95
 97%|█████████▋| 3387/3507 [1:24:15<03:26,  1.72s/it]                                                     {'loss': 0.1212, 'learning_rate': 6.137256123156631e-08, 'epoch': 0.97}
 97%|█████████▋| 3387/3507 [1:24:15<03:26,  1.72s/it]tensor([[-5.1250, -5.7188, -3.2812,  1.2578, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.0000, -4.3438,  0.2559,  1.5703, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:01,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.52 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.3438, -0.7344,  3.6406, -1.7500, -5.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.3125,  1.2500,  3.7969, -2.4375, -4.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8125, -2.8438,  0.9922, -1.1719, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7188, -5.1250, -0.7227,  2.7500, -2.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -3.5000,  2.0000,  0.7383, -4.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.9062, -5.3438,  1.3203,  1.7109, -5.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:09:01,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:09:01,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.94 | bwd_microstep: 180.60 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 179.60 | step_microstep: 3.23
[2025-11-06 19:09:01,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 327.49 | bwd: 181.59 | bwd_inner: 1.81 | bwd_allreduce: 179.64 | step: 3.32
 97%|█████████▋| 3388/3507 [1:24:15<02:42,  1.37s/it]                                                     {'loss': 0.6287, 'learning_rate': 6.035497308428229e-08, 'epoch': 0.97}
 97%|█████████▋| 3388/3507 [1:24:15<02:42,  1.37s/it]tensor([[-7.5312, -4.1250,  1.3203, -0.7148, -6.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8750,  0.2559,  3.3750, -1.6250, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:01,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.19 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.0938, -1.0000,  3.2969, -1.3672, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5938, -4.9688, -2.0312,  2.2969, -1.7109]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-0.7930,  2.8594,  2.8750, -2.0000, -1.9844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.1875, -4.3125,  1.9609, -1.0625, -6.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6250, -1.5234,  2.2812,  0.5078, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.4062, -1.7734,  2.3438,  3.2188, -1.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:09:02,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:09:02,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 103.25 | bwd_microstep: 656.62 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 655.50 | step_microstep: 1.53
[2025-11-06 19:09:02,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 244.45 | bwd: 657.61 | bwd_inner: 1.89 | bwd_allreduce: 655.55 | step: 1.64
 97%|█████████▋| 3389/3507 [1:24:16<02:26,  1.24s/it]                                                     {'loss': 0.7758, 'learning_rate': 5.934586613740245e-08, 'epoch': 0.97}
 97%|█████████▋| 3389/3507 [1:24:16<02:26,  1.24s/it]tensor([[-2.9219,  0.9883,  2.8281, -1.3359, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.8125,  1.7422,  4.1562,  0.4043, -2.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6250, -5.9062,  0.1787,  1.8516, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.9531, -2.3750,  1.1953,  3.9688, -0.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.1719, -0.2373,  3.5938,  3.8750, -1.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:03,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.25 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.3125, -0.1143,  3.6406, -0.9023, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3125, -2.3125,  1.4141,  1.6094, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -4.4062,  0.6602,  2.9375, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:09:05,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.25 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:09:05,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.83 | bwd_microstep: 1891.29 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 1890.05 | step_microstep: 1.79
[2025-11-06 19:09:05,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 492.11 | bwd: 1892.12 | bwd_inner: 1.90 | bwd_allreduce: 1890.08 | step: 1.87
 97%|█████████▋| 3390/3507 [1:24:19<03:17,  1.69s/it]                                                     {'loss': 0.6998, 'learning_rate': 5.8345241251969165e-08, 'epoch': 0.97}
 97%|█████████▋| 3390/3507 [1:24:19<03:17,  1.69s/it]tensor([[-6.1562, -5.6875, -1.5156,  1.8359, -3.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -4.5312, -1.5469,  3.0781, -1.1641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:05,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.78 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.0938, -6.2812, -1.6094,  1.3594, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156,  1.0547,  3.8125, -1.9297, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6875, -6.2188, -0.1001,  2.1875, -4.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.8438, -1.0078,  2.6562,  2.7812, -1.7422]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -4.0000,  1.5156,  1.9375, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-8.1875, -6.0625,  0.3945,  1.6172, -5.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:09:06,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.20
[2025-11-06 19:09:06,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.12 | bwd_microstep: 44.01 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 43.06 | step_microstep: 1.96
[2025-11-06 19:09:06,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 473.93 | bwd: 44.91 | bwd_inner: 1.67 | bwd_allreduce: 43.09 | step: 2.04
 97%|█████████▋| 3391/3507 [1:24:19<02:36,  1.35s/it]                                                     {'loss': 0.5153, 'learning_rate': 5.7353099281785004e-08, 'epoch': 0.97}
 97%|█████████▋| 3391/3507 [1:24:19<02:36,  1.35s/it]tensor([[-3.5625,  0.1338,  2.3281, -1.6406, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3906,  1.5156,  1.5859, -1.1797, -1.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.4375, -4.3125, -0.8203,  2.7344, -1.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:06,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.63 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-6.5938, -6.3125, -1.6250,  2.7031, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2188, -3.8281, -1.5625,  2.7031, -0.6133]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5000, -4.0938, -1.6641,  2.7031, -0.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.9688, -1.5625,  2.6719,  0.0176, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7031, -4.2188, -2.2656,  1.5547, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
[2025-11-06 19:09:07,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.19 | optimizer_step: 0.27
[2025-11-06 19:09:07,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.81 | bwd_microstep: 1373.68 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 1372.52 | step_microstep: 2.38
[2025-11-06 19:09:07,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 387.46 | bwd: 1374.61 | bwd_inner: 1.91 | bwd_allreduce: 1372.57 | step: 2.46
 97%|█████████▋| 3392/3507 [1:24:21<02:50,  1.49s/it]                                                     {'loss': 0.4989, 'learning_rate': 5.636944107341391e-08, 'epoch': 0.97}
 97%|█████████▋| 3392/3507 [1:24:21<02:50,  1.49s/it]tensor([[-4.1250, -3.0000,  0.8203,  2.4844, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.1875, -2.3750,  1.6953,  0.2373, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:08,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.26 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.4688,  2.2969,  3.3594, -1.4766, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.9688,  0.0294,  2.2969, -2.3438, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5625, -2.3281,  1.2969,  2.9219, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[4.2812, 5.5625, 7.2812, 7.1875, 3.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.0625, -6.5625, -0.5391,  1.7188, -4.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9531, -2.8281, -2.2188,  1.3906,  0.2090]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 19:09:09,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.72 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:09:09,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.05 | bwd_microstep: 1.76 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.54
[2025-11-06 19:09:09,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.31 | bwd: 2.73 | bwd_inner: 1.75 | bwd_allreduce: 0.84 | step: 2.64
 97%|█████████▋| 3393/3507 [1:24:23<02:50,  1.49s/it]                                                     {'loss': 0.6204, 'learning_rate': 5.539426746618337e-08, 'epoch': 0.97}
 97%|█████████▋| 3393/3507 [1:24:23<02:50,  1.49s/it]tensor([[-0.4395,  3.0156,  2.0781, -2.3281, -1.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:3')
tensor([[-4.8125, -1.5234,  2.7812,  0.5977, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.9688, -4.6875, -0.6445,  3.2031, -2.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:09,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.91 | bwd_microstep: 1.19 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4375, -4.2812,  1.7422,  2.4375, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.3984,  1.7734,  1.8125, -1.8828, -2.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
tensor([[-5.3125, -2.2500,  1.7656, -0.3066, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -1.5078,  3.1250, -0.5664, -5.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -2.4219,  2.8750,  0.7500, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:09:10,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 19:09:10,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.51 | bwd_microstep: 1018.61 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 1017.43 | step_microstep: 2.44
[2025-11-06 19:09:10,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 479.46 | bwd: 1019.80 | bwd_inner: 2.19 | bwd_allreduce: 1017.47 | step: 2.53
 97%|█████████▋| 3394/3507 [1:24:24<02:50,  1.51s/it]                                                     {'loss': 1.072, 'learning_rate': 5.442757929217779e-08, 'epoch': 0.97}
 97%|█████████▋| 3394/3507 [1:24:24<02:50,  1.51s/it]tensor([[-4.6250, -0.6172,  3.8125, -0.0096, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0938, -4.6562,  1.3906,  1.3516, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:11,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 154.08 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.8281, -4.3750, -2.1406,  2.0781, -1.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3750, -3.9531,  0.4805,  0.0752, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.8125,  0.2285,  3.0625, -1.2344, -4.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.1875, -3.4844, -1.1172,  2.8594, -0.6758]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.1250, -5.9062, -0.7695,  1.8125, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -3.4062,  0.8828,  1.4531, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:09:11,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:09:11,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.54 | bwd_microstep: 78.09 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 77.01 | step_microstep: 1.87
[2025-11-06 19:09:11,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.65 | bwd: 78.95 | bwd_inner: 1.75 | bwd_allreduce: 77.06 | step: 1.95
 97%|█████████▋| 3395/3507 [1:24:25<02:13,  1.19s/it]                                                     {'loss': 0.2936, 'learning_rate': 5.346937737624624e-08, 'epoch': 0.97}
 97%|█████████▋| 3395/3507 [1:24:25<02:13,  1.19s/it]tensor([[-2.1562,  2.2031,  4.0938, -1.7656, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:11,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.84 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.9062, -5.0938, -0.5859,  2.3438, -3.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -6.4062, -1.9766,  2.3594, -3.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5781,  2.2500,  3.3750, -1.8594, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.4375, -7.1562, -2.7344,  1.3359, -4.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2188, -4.6562,  0.4473,  2.1562, -3.7969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8906, -0.8086, -0.5273, -3.6250, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.4688, -4.3750, -0.5625,  1.4609, -3.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:09:12,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:09:12,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.41 | bwd_microstep: 719.04 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 718.16 | step_microstep: 1.68
[2025-11-06 19:09:12,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 266.26 | bwd: 719.73 | bwd_inner: 1.39 | bwd_allreduce: 718.20 | step: 1.77
 97%|█████████▋| 3396/3507 [1:24:26<02:06,  1.14s/it]                                                     {'loss': 0.2182, 'learning_rate': 5.251966253599028e-08, 'epoch': 0.97}
 97%|█████████▋| 3396/3507 [1:24:26<02:06,  1.14s/it]tensor([[-1.4766,  1.5234,  3.6875,  0.7148, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -2.7188,  2.5625,  0.8438, -4.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:12,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.70 | bwd_microstep: 0.81 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.1875, -2.4531,  1.2188,  1.9609, -2.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.1094,  1.2812,  2.9219, -2.6094, -4.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9688, -3.2500,  2.2188,  1.3125, -4.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8125, -4.7188, -0.1152,  2.4844, -3.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7188, -5.3438, -0.8945,  2.8438, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-9.0000, -6.5000, -0.7539, -1.0234, -6.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:13,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:09:13,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.83 | bwd_microstep: 2.35 | bwd_inner_microstep: 1.43 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.07
[2025-11-06 19:09:13,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 315.55 | bwd: 3.16 | bwd_inner: 2.15 | bwd_allreduce: 0.87 | step: 2.14
 97%|█████████▋| 3397/3507 [1:24:27<02:16,  1.24s/it]                                                     {'loss': 0.3352, 'learning_rate': 5.1578435581775e-08, 'epoch': 0.97}
 97%|█████████▋| 3397/3507 [1:24:27<02:16,  1.24s/it]tensor([[-5.9375, -4.7188,  0.1089,  2.5625, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-3.5156, -4.1250, -1.7500,  2.6250, -0.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([3], device='cuda:1')
 tensor([3], device='cuda:2')
tensor([[-3.6719, -0.2676,  2.6875, -0.6484, -3.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-9.4375, -8.8125, -3.7344,  0.0942, -5.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7656, -3.7812, -2.0000,  3.0156,  0.0229]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -5.0938,  0.8789,  2.0312, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:14,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.90 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7812, -4.4062, -0.5078,  2.8750, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.0625, -5.4688, -0.4023, -0.8789, -6.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:14,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:09:14,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.41 | bwd_microstep: 2.01 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.38
[2025-11-06 19:09:14,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 487.33 | bwd: 2.83 | bwd_inner: 1.84 | bwd_allreduce: 0.86 | step: 2.45
 97%|█████████▋| 3398/3507 [1:24:28<01:51,  1.03s/it]                                                     {'loss': 0.1389, 'learning_rate': 5.06456973167202e-08, 'epoch': 0.97}
 97%|█████████▋| 3398/3507 [1:24:28<01:51,  1.03s/it]tensor([[-4.6250, -1.5078,  2.9219,  1.0391, -3.7969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.3125, -1.1406,  1.8516, -0.7891, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2188e+00,  5.3787e-04,  2.8281e+00, -1.9453e+00, -4.5625e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -3.8750,  0.4746,  1.5000, -3.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:15,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.93 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.5938, -3.0312,  0.8906,  1.5703, -2.9844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.6875,  1.4609,  2.2969, -0.6445, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([0], device='cuda:2')
tensor([[-5.5938, -2.5156,  2.6406,  1.1562, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -2.8281,  1.3281,  1.2500, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:09:16,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.21 | optimizer_step: 0.20
[2025-11-06 19:09:16,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 472.04 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 470.96 | step_microstep: 1.93
[2025-11-06 19:09:16,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 367.57 | bwd: 473.00 | bwd_inner: 1.81 | bwd_allreduce: 471.01 | step: 2.04
 97%|█████████▋| 3399/3507 [1:24:30<02:21,  1.31s/it]                                                     {'loss': 0.9909, 'learning_rate': 4.972144853670369e-08, 'epoch': 0.97}
 97%|█████████▋| 3399/3507 [1:24:30<02:21,  1.31s/it]tensor([[-5.7500, -3.0312,  2.6719,  1.6406, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:16,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.19 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.61 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.9062, -2.8281,  1.1953,  0.5977, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -5.5938, -1.8281,  2.4375, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7812, -3.9219,  0.0452,  2.6250, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0000, -1.0234,  2.1250,  0.2041, -3.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.6406, -2.7812,  0.0425,  2.4375, -1.6016]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2500, -5.1250, -0.3730,  1.7344, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.8125,  2.5625,  3.3906, -2.4531, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 19:09:17,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:09:17,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.38 | bwd_microstep: 844.05 | bwd_inner_microstep: 3.50 | bwd_allreduce_microstep: 840.45 | step_microstep: 1.69
[2025-11-06 19:09:17,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 283.60 | bwd: 844.77 | bwd_inner: 4.13 | bwd_allreduce: 840.49 | step: 1.77
 97%|█████████▋| 3400/3507 [1:24:31<02:15,  1.27s/it]                                                     {'loss': 0.3091, 'learning_rate': 4.8805690030360176e-08, 'epoch': 0.97}
 97%|█████████▋| 3400/3507 [1:24:31<02:15,  1.27s/it]tensor([[-3.3125, -4.0000, -2.3438,  1.5469, -0.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:17,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.42 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8125, -3.9688,  0.6094,  1.1484, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -1.4062,  3.1250, -2.3281, -6.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.2656,  0.9570,  3.0938, -2.2812, -4.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.0938, -4.5000,  1.3828,  1.3828, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0625,  0.4375,  1.2344, -2.7031, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.9688,  0.2061,  1.9688, -1.2344, -3.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.1875, -6.6562, -0.4023,  1.7969, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:09:19,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.25 | optimizer_step: 0.43
[2025-11-06 19:09:19,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.79 | bwd_microstep: 631.03 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 629.97 | step_microstep: 2.88
[2025-11-06 19:09:19,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 275.22 | bwd: 631.92 | bwd_inner: 1.74 | bwd_allreduce: 630.02 | step: 2.96
 97%|█████████▋| 3401/3507 [1:24:33<02:47,  1.58s/it]                                                     {'loss': 0.4326, 'learning_rate': 4.789842257907795e-08, 'epoch': 0.97}
 97%|█████████▋| 3401/3507 [1:24:33<02:47,  1.58s/it]tensor([[-1.4609,  2.6875,  3.5625, -2.2188, -2.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.4375, -1.5234,  2.6562,  5.3750, -0.4883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.1250, -4.4688,  0.4434,  1.9219, -3.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9375, -4.6250, -0.7812,  2.7188, -2.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:20,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.70 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.8125, -1.6172,  2.7500,  0.5586, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.0469,  1.4453,  3.5938, -0.4043, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -2.9062,  1.1562,  0.1621, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -3.0156,  1.4297,  1.8516, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:20,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.67 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:09:20,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.45 | bwd_microstep: 1.52 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.68 | step_microstep: 2.15
[2025-11-06 19:09:20,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 389.17 | bwd: 2.25 | bwd_inner: 1.38 | bwd_allreduce: 0.72 | step: 2.24
 97%|█████████▋| 3402/3507 [1:24:34<02:10,  1.24s/it]                                                     {'loss': 0.458, 'learning_rate': 4.699964695699999e-08, 'epoch': 0.97}
 97%|█████████▋| 3402/3507 [1:24:34<02:10,  1.24s/it]tensor([[-1.1719,  2.0156,  1.5469, -2.6250, -2.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.7031,  0.4434,  4.4062, -0.2773, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.2188, -1.1641,  2.3438, -0.0199, -3.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -5.0625,  1.0547,  2.6562, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:20,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.51 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.2812, -0.2393,  3.7656, -0.6953, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8438, -4.9688, -0.2695,  2.4375, -3.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3750, -2.7500,  2.6250, -0.0141, -5.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6250, -4.0938,  1.4531,  1.3125, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:09:23,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.74 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 19:09:23,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.84 | bwd_microstep: 1895.33 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 1894.08 | step_microstep: 2.76
[2025-11-06 19:09:23,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 421.38 | bwd: 1896.33 | bwd_inner: 2.05 | bwd_allreduce: 1894.13 | step: 2.85
 97%|█████████▋| 3403/3507 [1:24:37<03:16,  1.89s/it]                                                     {'loss': 0.2077, 'learning_rate': 4.610936393102616e-08, 'epoch': 0.97}
 97%|█████████▋| 3403/3507 [1:24:37<03:16,  1.89s/it]tensor([[-2.4531, -3.4531, -2.0312,  2.5156,  0.0811]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.9375, -4.2812,  0.6797, -2.2969, -6.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -2.7188,  1.3906, -0.0811, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:23,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.15 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-5.5625, -4.2500,  1.0938,  3.5000, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.9375, -2.4375,  2.2344, -0.3027, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.4375, -6.5000, -0.4277,  0.8281, -5.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -0.9062,  3.7969, -0.2354, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2188, -3.8125,  1.3359,  3.2656, -2.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:09:24,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:09:24,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.57 | bwd_microstep: 56.73 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 55.78 | step_microstep: 1.91
[2025-11-06 19:09:24,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.74 | bwd: 57.77 | bwd_inner: 1.48 | bwd_allreduce: 56.12 | step: 2.00
 97%|█████████▋| 3404/3507 [1:24:37<02:30,  1.46s/it]                                                     {'loss': 0.4628, 'learning_rate': 4.522757426080771e-08, 'epoch': 0.97}
 97%|█████████▋| 3404/3507 [1:24:37<02:30,  1.46s/it]tensor([[-6.6875, -4.0312,  1.5859,  1.1719, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7031, -0.4336,  1.6250, -1.3906, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4375, -5.1250, -1.1016,  2.6094, -2.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:24,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.89 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
tensor([[-0.8633,  1.2578,  3.1250,  1.9297, -0.7227]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-6.8125, -3.6250,  1.3438, -0.6523, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8750, -2.2500,  2.2969,  1.1953, -3.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1562, -1.8516,  2.6250,  0.3125, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5625, -4.8125, -1.4141,  3.1406, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:09:25,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:09:25,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.69 | bwd_microstep: 103.10 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 101.50 | step_microstep: 1.96
[2025-11-06 19:09:25,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.61 | bwd: 103.93 | bwd_inner: 2.17 | bwd_allreduce: 101.54 | step: 2.06
 97%|█████████▋| 3405/3507 [1:24:38<02:12,  1.30s/it]                                                     {'loss': 0.6317, 'learning_rate': 4.435427869874942e-08, 'epoch': 0.97}
 97%|█████████▋| 3405/3507 [1:24:38<02:12,  1.30s/it]tensor([[-3.9375, -3.2969,  0.7969,  3.7656, -1.6172]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:25,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.76 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
tensor([[-5.4688, -3.7969,  0.5586,  1.7656, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7656,  0.6914,  2.8594, -0.6133, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -3.9375,  1.1797,  2.7500, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -1.3047,  3.7656, -0.2363, -5.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0938, -3.1562,  1.1641, -0.4473, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8125, -2.3125,  1.7578,  1.0547, -3.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -0.3594,  2.8750, -3.1562, -5.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:28,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.52 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:09:28,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.74 | bwd_microstep: 1.84 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.82 | step_microstep: 4.20
[2025-11-06 19:09:28,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.52 | bwd: 2.74 | bwd_inner: 1.72 | bwd_allreduce: 0.87 | step: 4.32
 97%|█████████▋| 3406/3507 [1:24:42<03:16,  1.95s/it]                                                     {'loss': 0.6267, 'learning_rate': 4.3489477990007466e-08, 'epoch': 0.97}
 97%|█████████▋| 3406/3507 [1:24:42<03:16,  1.95s/it]tensor([[-5.2500, -5.1250, -1.5312,  2.2812, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -4.0312, -0.0075,  2.6562, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.9414,  2.0938,  3.9375,  1.6562, -1.1172]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.3125, -1.6328,  3.0469,  2.1094, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:28,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.78 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.0312, -1.2812,  1.9844,  0.0240, -3.3906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -4.5625,  0.5547,  2.6875, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.4688, -3.8281,  2.0156,  1.6484, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5625, -3.7812,  0.5859,  1.2031, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:09:28,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.61 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:09:28,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 157.36 | bwd_microstep: 8.19 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 7.40 | step_microstep: 2.13
[2025-11-06 19:09:28,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.17 | bwd: 8.86 | bwd_inner: 1.27 | bwd_allreduce: 7.44 | step: 2.21
 97%|█████████▋| 3407/3507 [1:24:42<02:28,  1.49s/it]                                                     {'loss': 0.7146, 'learning_rate': 4.263317287249158e-08, 'epoch': 0.97}
 97%|█████████▋| 3407/3507 [1:24:42<02:28,  1.49s/it]tensor([[-1.1641,  2.7031,  3.2500, -2.0000, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.9688, -3.6250,  0.3848,  0.2383, -4.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:29,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.96 | bwd_microstep: 1.13 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-4.2812,  0.4785,  3.8906, -2.5312, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -4.7188, -2.0469,  2.5469, -1.2109]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-8.3125, -5.5312,  1.3594,  1.4453, -5.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.0000, -2.5312, -0.0674,  2.0469, -1.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-9.1875, -6.0625,  0.9102,  0.3047, -6.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.7188, -3.5625, -2.3438,  1.6250, -0.3301]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:31,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.58 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 19:09:31,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.55 | bwd_microstep: 2.14 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.98 | step_microstep: 2.54
[2025-11-06 19:09:31,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 348.85 | bwd: 3.27 | bwd_inner: 2.03 | bwd_allreduce: 1.02 | step: 2.66
 97%|█████████▋| 3408/3507 [1:24:45<02:54,  1.76s/it]                                                     {'loss': 0.403, 'learning_rate': 4.1785364076859515e-08, 'epoch': 0.97}
 97%|█████████▋| 3408/3507 [1:24:45<02:54,  1.76s/it]tensor([[-6.6250, -3.3438,  2.3281,  0.7266, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -0.9062,  3.2812,  0.7852, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.3750, -1.2109,  2.6250,  0.2656, -3.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7188, -5.6250, -0.6562,  1.9844, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:31,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.50 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.2500, -3.3594,  0.2695,  2.3438, -2.2031]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2344, -4.0625, -2.5625,  1.6875, -0.6758]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-1.8281, -0.6250,  1.9375,  2.6875, -0.8086]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.5938, -4.3125, -2.3750,  1.8203, -0.9883]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:09:32,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:09:32,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.20 | bwd_microstep: 369.01 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 367.92 | step_microstep: 1.78
[2025-11-06 19:09:32,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.72 | bwd: 369.80 | bwd_inner: 1.68 | bwd_allreduce: 367.97 | step: 1.87
 97%|█████████▋| 3409/3507 [1:24:45<02:22,  1.46s/it]                                                     {'loss': 0.4473, 'learning_rate': 4.0946052326522603e-08, 'epoch': 0.97}
 97%|█████████▋| 3409/3507 [1:24:45<02:22,  1.46s/it]tensor([[-3.2031, -3.8281, -2.4219,  1.6562, -0.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-6.7188, -3.6562,  2.3438,  1.1172, -5.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:32,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.02 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.5625, -4.3125,  1.9609,  0.2080, -5.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.1562, -1.0391,  2.3594, -0.0923, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8750, -5.4688, -1.0703,  2.6406, -2.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5312, -5.2500, -1.1094,  2.5469, -2.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.3438, -3.7031,  0.0215,  3.0156, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.7812, -2.3125,  3.3594, -1.1562, -6.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:33,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:09:33,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.51 | bwd_microstep: 2.22 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 1.04 | step_microstep: 2.16
[2025-11-06 19:09:33,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 357.55 | bwd: 3.02 | bwd_inner: 1.80 | bwd_allreduce: 1.07 | step: 2.24
 97%|█████████▋| 3410/3507 [1:24:47<02:25,  1.50s/it]                                                     {'loss': 0.5415, 'learning_rate': 4.011523833763909e-08, 'epoch': 0.97}
 97%|█████████▋| 3410/3507 [1:24:47<02:25,  1.50s/it]tensor([[-2.2344,  1.0469,  1.8125, -1.6328, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-3.4531, -4.4375, -3.1562,  1.2969, -0.7148]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -3.1406,  0.4238, -0.3574, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:33,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.92 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.4375, -4.4688,  0.2324,  2.9531, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6562, -5.7812,  0.3008,  1.7188, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0938, -3.8125,  0.3438,  1.9062, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -1.2500,  1.3906, -0.7148, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.6562, -4.6562, -1.4766,  2.2969, -1.9297]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:34,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.18 | optimizer_step: 0.20
[2025-11-06 19:09:34,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.48 | bwd_microstep: 2.04 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.90
[2025-11-06 19:09:34,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.42 | bwd: 2.88 | bwd_inner: 1.82 | bwd_allreduce: 0.91 | step: 2.97
 97%|█████████▋| 3411/3507 [1:24:47<01:51,  1.17s/it]                                                     {'loss': 0.6456, 'learning_rate': 3.929292281911856e-08, 'epoch': 0.97}
 97%|█████████▋| 3411/3507 [1:24:47<01:51,  1.17s/it]tensor([[-6.2188, -5.9062, -2.2344,  1.4766, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5938, -3.4219,  0.3594,  2.0312, -2.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:34,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.93 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.8438, -4.0938,  0.0476,  2.8438, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.5859,  2.6562,  3.9375, -1.5781, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5938, -2.7812,  2.9375,  0.0255, -5.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3750, -2.7188,  1.8828,  0.9961, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -2.0781,  3.4844,  0.4551, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5625, -1.3047,  2.4688, -0.2930, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:36,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.15 | optimizer_step: 0.21
[2025-11-06 19:09:36,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.91 | bwd_microstep: 2.09 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.97 | step_microstep: 2.42
[2025-11-06 19:09:36,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 317.86 | bwd: 2.95 | bwd_inner: 1.81 | bwd_allreduce: 1.00 | step: 2.49
 97%|█████████▋| 3412/3507 [1:24:50<02:27,  1.56s/it]                                                     {'loss': 0.2464, 'learning_rate': 3.847910647261754e-08, 'epoch': 0.97}
 97%|█████████▋| 3412/3507 [1:24:50<02:27,  1.56s/it]tensor([[-5.5312, -0.9102,  3.9531, -1.3516, -5.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -5.6250, -1.8984,  2.5469, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.9219,  0.1289,  2.1094, -2.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:36,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.00 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.7812, -0.6211,  3.9375, -0.4199, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.5312, -4.0000,  1.1562,  1.0391, -4.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3750, -3.0156,  0.1377,  3.1406, -1.1797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.5625, -4.9062, -0.4961,  2.5156, -2.8906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.0938,  0.2451,  1.3203, -3.6406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:09:36,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:09:36,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.48 | bwd_microstep: 70.02 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 68.67 | step_microstep: 1.54
[2025-11-06 19:09:36,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.51 | bwd: 71.03 | bwd_inner: 2.18 | bwd_allreduce: 68.71 | step: 1.62
 97%|█████████▋| 3413/3507 [1:24:50<01:54,  1.21s/it]                                                     {'loss': 0.5539, 'learning_rate': 3.767378999254168e-08, 'epoch': 0.97}
 97%|█████████▋| 3413/3507 [1:24:50<01:54,  1.21s/it]tensor([[-3.8125, -4.7188, -2.1250,  3.0312, -0.8047]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5625, -4.0312, -1.2891,  3.2031, -0.8359]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:37,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.09 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.5625, -3.5469,  0.0874,  2.2031, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3125, -1.6953,  1.7969, -1.5391, -4.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -2.2188,  1.1328,  0.2949, -3.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5312, -1.3672,  1.7109, -0.9258, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1250, -3.9531, -0.3242,  3.3906, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -4.6875, -0.7461,  2.5938, -2.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:37,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 19:09:37,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.62 | bwd_microstep: 2.03 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 0.89 | step_microstep: 1.72
[2025-11-06 19:09:37,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 332.72 | bwd: 2.79 | bwd_inner: 1.73 | bwd_allreduce: 0.92 | step: 1.79
 97%|█████████▋| 3414/3507 [1:24:51<01:42,  1.10s/it]                                                     {'loss': 0.1042, 'learning_rate': 3.68769740660424e-08, 'epoch': 0.97}
 97%|█████████▋| 3414/3507 [1:24:51<01:42,  1.10s/it]tensor([[-6.2812, -2.7031,  1.9922, -0.5742, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:37,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.15 | bwd_microstep: 1.08 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.5312, -4.1562, -2.1562,  2.0781, -0.9570]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.6250, -5.8750, -1.8984,  2.9062, -2.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.0625, -4.4375, -0.0267,  3.0625, -2.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.8125, -3.1562,  0.6406,  1.6094, -3.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -5.0312, -0.2598,  2.9844, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.1562, -3.6406,  0.4570,  3.7188, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.4375, -4.2812,  0.2715, -0.1670, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:09:40,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:09:40,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 124.02 | bwd_microstep: 1301.50 | bwd_inner_microstep: 1.59 | bwd_allreduce_microstep: 1299.81 | step_microstep: 1.88
[2025-11-06 19:09:40,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 259.18 | bwd: 1302.58 | bwd_inner: 2.58 | bwd_allreduce: 1299.85 | step: 1.96
 97%|█████████▋| 3415/3507 [1:24:54<02:37,  1.71s/it]                                                     {'loss': 0.2637, 'learning_rate': 3.6088659373019195e-08, 'epoch': 0.97}
 97%|█████████▋| 3415/3507 [1:24:54<02:37,  1.71s/it]tensor([[-4.2188, -3.6094,  0.0986,  2.9375, -1.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6094, -4.0312, -0.8711,  3.5781, -0.8672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -2.1562,  1.1484,  1.0625, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:41,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.00 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.9062, -4.3125, -0.0120,  3.0938, -2.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-9.4375, -5.9688,  0.2041, -1.1328, -7.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7812, -6.4062, -4.8125, -0.8672, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9297, -3.0312, -2.6250,  1.4922,  0.3965]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-7.9062, -5.2812,  1.1406,  1.1094, -5.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:41,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 19:09:41,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.90 | bwd_microstep: 2.08 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 0.89 | step_microstep: 1.92
[2025-11-06 19:09:41,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.92 | bwd: 2.94 | bwd_inner: 1.87 | bwd_allreduce: 0.93 | step: 2.01
 97%|█████████▋| 3416/3507 [1:24:55<02:01,  1.33s/it]                                                     {'loss': 0.7727, 'learning_rate': 3.530884658611733e-08, 'epoch': 0.97}
 97%|█████████▋| 3416/3507 [1:24:55<02:01,  1.33s/it]tensor([[-4.9375, -5.5938, -2.6094,  2.4062, -1.6641]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:41,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.63 | bwd_microstep: 1.10 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5625, -1.1094,  0.1377, -3.6094, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.0625, -3.6562,  2.1406,  0.1641, -5.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0000, -3.6094, -1.0078,  1.6328, -1.8203]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-0.9805,  2.9219,  2.6250, -2.6875, -2.3281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.2812, -5.3750, -0.3086,  2.9375, -3.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4062, -0.2695,  2.7344, -2.0625, -4.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812, -0.2129,  2.9688, -1.0234, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:09:43,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:09:43,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.91 | bwd_microstep: 609.94 | bwd_inner_microstep: 1.15 | bwd_allreduce_microstep: 608.69 | step_microstep: 1.69
[2025-11-06 19:09:43,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.58 | bwd: 611.03 | bwd_inner: 2.15 | bwd_allreduce: 608.73 | step: 1.78
 97%|█████████▋| 3417/3507 [1:24:57<02:15,  1.50s/it]                                                     {'loss': 0.1596, 'learning_rate': 3.453753637072788e-08, 'epoch': 0.97}
 97%|█████████▋| 3417/3507 [1:24:57<02:15,  1.50s/it]tensor([[-5.6562, -2.1719,  2.4688, -0.1592, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1562, -3.7344,  0.2676,  3.5938, -1.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:43,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.03 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.2500, -4.3438, -1.5000,  2.0000, -1.7656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7188, -0.0583,  4.1562, -1.5781, -5.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9062, -4.6875, -0.6836,  3.2188, -2.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3438, -2.4062,  1.5391, -0.2441, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -4.5312,  0.2168,  3.0312, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -2.1875,  2.9375, -0.4277, -5.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:43,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:09:43,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.99 | bwd_microstep: 2.31 | bwd_inner_microstep: 1.50 | bwd_allreduce_microstep: 0.73 | step_microstep: 1.40
[2025-11-06 19:09:43,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 491.05 | bwd: 3.27 | bwd_inner: 2.39 | bwd_allreduce: 0.76 | step: 1.47
 97%|█████████▋| 3418/3507 [1:24:57<01:47,  1.21s/it]                                                     {'loss': 0.062, 'learning_rate': 3.3774729384986605e-08, 'epoch': 0.97}
 97%|█████████▋| 3418/3507 [1:24:57<01:47,  1.21s/it]tensor([[-5.4062, -1.7656,  2.8281, -0.0586, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.5312,  0.2793,  3.5938, -2.7656, -5.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:44,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.18 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.6094, -2.8125,  0.8672,  3.5938, -1.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.6562, -4.5312,  1.2812, -0.0786, -5.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6875, -2.5781,  1.5234,  1.7500, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8281, -2.5469,  0.2793,  3.0938, -0.8164]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -2.9219,  0.8516,  0.9648, -3.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5625, -2.0000,  1.4766, -1.7109, -5.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:45,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.55 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:09:45,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.79 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.80 | step_microstep: 2.06
[2025-11-06 19:09:45,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.00 | bwd: 2.79 | bwd_inner: 1.81 | bwd_allreduce: 0.84 | step: 2.14
 97%|█████████▋| 3419/3507 [1:24:59<02:10,  1.49s/it]                                                     {'loss': 0.606, 'learning_rate': 3.3020426279773974e-08, 'epoch': 0.97}
 97%|█████████▋| 3419/3507 [1:24:59<02:10,  1.49s/it]tensor([[-5.8438, -4.5938,  0.3027,  2.6406, -3.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0625, -3.8906, -0.2422,  3.4688, -1.4297]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -4.0625, -1.5234,  2.0938, -1.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:46,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.95 | bwd_microstep: 0.80 | bwd_inner_microstep: 0.69 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.07
tensor([[-4.7812, -2.5312,  1.1719,  0.8945, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2500, -1.0703,  2.3125,  1.8594, -2.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.3125,  1.8984,  2.9844, -2.4062, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.9688, -4.5000, -0.7852,  2.5625, -2.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -3.2188,  1.4844,  0.4062, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:46,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.15 | optimizer_step: 0.18
[2025-11-06 19:09:46,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.56 | bwd_microstep: 1.77 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.70 | step_microstep: 2.51
[2025-11-06 19:09:46,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 368.52 | bwd: 2.56 | bwd_inner: 1.69 | bwd_allreduce: 0.74 | step: 2.58
 98%|█████████▊| 3420/3507 [1:25:00<01:41,  1.16s/it]                                                     {'loss': 0.5132, 'learning_rate': 3.227462769871404e-08, 'epoch': 0.98}
 98%|█████████▊| 3420/3507 [1:25:00<01:41,  1.16s/it]tensor([[-1.5703,  2.2344,  2.5469, -2.3125, -2.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:09:46,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.00 | bwd_microstep: 1.01 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.9531, -1.5625,  1.9609,  0.9844, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.3438, -2.1406,  1.6328,  1.4141, -3.0469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -5.5938, -0.7930,  1.6875, -3.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.6250, -0.4668,  2.3750, -0.4785, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -5.5938, -0.7383,  2.9375, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.7188, -0.6953,  3.7812, -0.3613, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -5.0938, -1.2344,  0.6797, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:09:49,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.24 | optimizer_step: 0.30
[2025-11-06 19:09:49,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.37 | bwd_microstep: 1489.58 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 1488.44 | step_microstep: 334.20
[2025-11-06 19:09:49,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 303.38 | bwd: 1490.59 | bwd_inner: 1.97 | bwd_allreduce: 1488.50 | step: 334.29
 98%|█████████▊| 3421/3507 [1:25:03<02:29,  1.74s/it]                                                     {'loss': 0.3075, 'learning_rate': 3.153733427817329e-08, 'epoch': 0.98}
 98%|█████████▊| 3421/3507 [1:25:03<02:29,  1.74s/it]tensor([[-5.1250, -3.2812,  1.1328,  2.0469, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:49,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.84 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9219, -2.9531, -0.1025,  3.4062, -0.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.4062, -2.6094,  1.5859,  2.4844, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.5625, -2.1406,  1.3906,  4.4688, -0.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.0000, -5.3750,  0.5000,  2.4531, -4.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.9688, -2.7969,  1.9609, -0.1533, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -1.5156,  2.9688,  2.8594, -2.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.6562, -1.4453,  3.8438, -0.4531, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:09:49,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:09:49,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.84 | bwd_microstep: 138.43 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 137.27 | step_microstep: 1.60
[2025-11-06 19:09:49,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.69 | bwd: 139.30 | bwd_inner: 1.87 | bwd_allreduce: 137.30 | step: 1.68
 98%|█████████▊| 3422/3507 [1:25:03<01:55,  1.36s/it]                                                     {'loss': 0.8488, 'learning_rate': 3.080854664726296e-08, 'epoch': 0.98}
 98%|█████████▊| 3422/3507 [1:25:03<01:55,  1.36s/it]tensor([[-4.9688, -0.9375,  3.3594, -0.8086, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4062, -3.8750, -0.8320, -1.9609, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5938, -2.6094,  1.3438,  1.5391, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:50,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.37 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.7812,  0.8594,  2.9688, -1.4688, -3.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.8125,  0.2119,  3.1875, -1.8516, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-8.5625, -5.4062, -0.4844, -2.4219, -6.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.2500, -0.9766,  2.3438, -0.1973, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.9375,  0.4492,  3.5938, -1.9062, -4.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:09:51,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:09:51,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.47 | bwd_microstep: 361.44 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 360.26 | step_microstep: 1.92
[2025-11-06 19:09:51,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.86 | bwd: 362.30 | bwd_inner: 1.86 | bwd_allreduce: 360.29 | step: 2.00
 98%|█████████▊| 3423/3507 [1:25:05<02:02,  1.45s/it]                                                     {'loss': 0.5545, 'learning_rate': 3.008826542783561e-08, 'epoch': 0.98}
 98%|█████████▊| 3423/3507 [1:25:05<02:02,  1.45s/it]tensor([[-4.9062, -3.6719, -0.0747,  1.7344, -2.7656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.4375, -1.5547,  1.9062,  0.1660, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2188,  1.3203,  3.3438, -2.7500, -4.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-5.0625, -4.5938, -0.9531,  2.1406, -2.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:51,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.09 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-1.4375,  1.8594,  2.5312, -1.7031, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -2.3281,  2.3281,  1.2578, -3.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9688,  0.2754,  1.6328, -1.5703, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-5.1875, -0.8672,  2.3281, -2.6562, -5.3438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:52,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:09:52,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.95 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.88 | step_microstep: 3.06
[2025-11-06 19:09:52,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 526.06 | bwd: 2.84 | bwd_inner: 1.80 | bwd_allreduce: 0.91 | step: 3.14
 98%|█████████▊| 3424/3507 [1:25:06<01:39,  1.20s/it]                                                     {'loss': 0.6243, 'learning_rate': 2.937649123448627e-08, 'epoch': 0.98}
 98%|█████████▊| 3424/3507 [1:25:06<01:39,  1.20s/it]tensor([[-3.2812, -4.0000, -2.5000,  1.6797, -0.6602]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -1.4922,  2.1719, -1.6797, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.4375, -5.4375,  0.9688,  2.3281, -4.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:52,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.26 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.5000, -4.8750,  1.3906,  1.1953, -5.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -5.3438, -3.0000,  1.9141, -1.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7500, -0.4492,  3.5938, -1.6406, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -0.9805,  3.8281, -1.5156, -5.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.9375, -2.7656,  1.8438, -0.2471, -4.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:09:54,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:09:54,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.98 | bwd_microstep: 589.07 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 588.03 | step_microstep: 2.63
[2025-11-06 19:09:54,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 535.27 | bwd: 590.05 | bwd_inner: 1.83 | bwd_allreduce: 588.07 | step: 2.71
 98%|█████████▊| 3425/3507 [1:25:08<02:15,  1.65s/it]                                                     {'loss': 0.1504, 'learning_rate': 2.8673224674548028e-08, 'epoch': 0.98}
 98%|█████████▊| 3425/3507 [1:25:08<02:15,  1.65s/it]tensor([[-3.9844, -4.5625, -2.2500,  2.0469, -1.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -2.9375,  1.7422,  2.0938, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:09:55,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.04 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-4.1250, -4.7188, -2.3906,  1.9453, -1.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-5.8750, -2.9375,  1.8750,  0.3711, -4.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.9062,  1.3359,  2.5000, -3.1406, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -1.3984,  3.1094, -0.7227, -5.0312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[ 0.3594,  3.8594,  3.4375, -1.1484, -0.9961]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-2.8438,  0.1953,  2.1562, -0.4219, -2.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:09:55,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.27 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:09:55,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.02 | bwd_microstep: 25.55 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 24.38 | step_microstep: 3.19
[2025-11-06 19:09:55,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 363.09 | bwd: 26.46 | bwd_inner: 1.87 | bwd_allreduce: 24.43 | step: 3.29
 98%|█████████▊| 3426/3507 [1:25:09<01:44,  1.29s/it]                                                     {'loss': 0.6896, 'learning_rate': 2.7978466348100863e-08, 'epoch': 0.98}
 98%|█████████▊| 3426/3507 [1:25:09<01:44,  1.29s/it]tensor([[-2.0938, -3.0625, -2.0469,  2.3125,  0.3828]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-4.2500, -4.4375, -1.1953,  3.1875, -1.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:55,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.82 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08
tensor([[-4.9375, -4.1562,  0.1089,  2.7031, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.1875, -4.7812, -1.9688,  2.9375, -1.1953]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.5625, -6.5000, -1.0391,  2.0000, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.0938, -5.9375, -0.0072,  3.0469, -3.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[19:09:57] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch3/ExtremeSportsPovStockFootage-Adventurevideory.com.mp4, No such file or directory
Failed to load video: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch23/Worth_Knowing_-_videos_of_America_Palos_Park_Worth.mp4, the dataset is: sharegpt4v_instruct_gpt4-vision_cap100k
Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch3/ExtremeSportsPovStockFootage-Adventurevideory.com.mp4... sharegpt4v_instruct_gpt4-vision_cap100k
Traceback (most recent call last):
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 718, in __getitem__
    ret=self.video_get_item(data_item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 604, in video_get_item
    image_list,frame_indices = self.load_video(video_path)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/pretrain/internvl-sft/internvl_clip_linear_probe/internvl/train/internvl_chat_finetune_dist.py", line 582, in load_video
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared-storage-user/jiaziheng/miniconda3/envs/visualquality/lib/python3.11/site-packages/decord/video_reader.py", line 57, in __init__
    raise RuntimeError("Error reading " + uri + "...")
RuntimeError: Error reading /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch3/ExtremeSportsPovStockFootage-Adventurevideory.com.mp4...
tensor([[-5.3125, -3.9688,  0.3477,  2.0625, -3.1094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -4.1562,  1.6797,  1.6953, -4.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:09:57,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:09:57,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.10 | bwd_microstep: 1834.73 | bwd_inner_microstep: 1.16 | bwd_allreduce_microstep: 1833.48 | step_microstep: 1.78
[2025-11-06 19:09:57,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 336.94 | bwd: 1835.65 | bwd_inner: 1.98 | bwd_allreduce: 1833.53 | step: 1.86
 98%|█████████▊| 3427/3507 [1:25:11<02:05,  1.56s/it]                                                     {'loss': 0.397, 'learning_rate': 2.7292216847957242e-08, 'epoch': 0.98}
 98%|█████████▊| 3427/3507 [1:25:11<02:05,  1.56s/it]tensor([[-5.0625, -5.0000, -2.3125,  1.3281, -2.3281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:57,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.47 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.5312, -4.9688, -0.8867,  0.5938, -4.1875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.4531, -3.3125, -0.9336,  2.2656, -1.1797]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0000, -0.3867,  1.8594, -2.0938, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0625, -1.4844,  2.0312,  1.0078, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.0625, -3.1094,  0.4590,  2.5469, -2.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.9531,  1.6484,  4.0625, -2.4219, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -3.8438,  0.9258,  2.1250, -3.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:09:57,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.39 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:09:57,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.66 | bwd_microstep: 52.13 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 50.93 | step_microstep: 1.98
[2025-11-06 19:09:57,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 314.15 | bwd: 53.08 | bwd_inner: 1.98 | bwd_allreduce: 50.98 | step: 2.08
 98%|█████████▊| 3428/3507 [1:25:11<01:35,  1.22s/it]                                                     {'loss': 0.1611, 'learning_rate': 2.6614476759676546e-08, 'epoch': 0.98}
 98%|█████████▊| 3428/3507 [1:25:11<01:35,  1.22s/it]tensor([[-3.4844, -3.9375, -1.3281,  2.8281, -0.9102]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:09:58,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.32 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.9688, -6.1562,  0.2852,  2.0781, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -4.4688,  0.4160,  2.5000, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5469, -3.8594, -1.1641,  2.8906, -0.9570]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0312, -2.1406,  0.2148, -4.1250, -5.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-3.5469, -3.0312,  0.4355,  3.3438, -1.3672]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7188, -2.2969,  2.5312,  0.1328, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.6250, -4.3438, -0.6211,  2.8906, -1.9922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:10:01,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.16 | optimizer_step: 0.25
[2025-11-06 19:10:01,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.64 | bwd_microstep: 1095.92 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 1094.64 | step_microstep: 1.86
[2025-11-06 19:10:01,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 350.97 | bwd: 1096.90 | bwd_inner: 2.08 | bwd_allreduce: 1094.68 | step: 1.94
 98%|█████████▊| 3429/3507 [1:25:14<02:21,  1.82s/it]                                                     {'loss': 0.378, 'learning_rate': 2.5945246661551738e-08, 'epoch': 0.98}
 98%|█████████▊| 3429/3507 [1:25:15<02:21,  1.82s/it]tensor([[-6.5000, -5.3125,  0.1006,  2.7031, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-6.1250, -1.9922,  3.3281, -0.6602, -5.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([3], device='cuda:1')
tensor([[-3.2344, -2.6562,  1.4766,  4.4688, -1.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:01,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.24 | bwd_microstep: 1.35 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.18
tensor([[-5.0312, -2.0469,  2.7031,  0.7500, -4.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -0.4766,  2.6719, -1.1719, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5938, -4.4688, -1.3438,  2.1719, -2.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5000, -1.4922,  2.0312, -0.0157, -3.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.5625, -5.7188,  0.7539,  2.4688, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:01,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:10:01,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 145.62 | bwd_microstep: 1.92 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 0.73 | step_microstep: 1.65
[2025-11-06 19:10:01,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 401.91 | bwd: 3.29 | bwd_inner: 2.25 | bwd_allreduce: 0.84 | step: 1.86
 98%|█████████▊| 3430/3507 [1:25:15<01:48,  1.41s/it]                                                     {'loss': 1.1266, 'learning_rate': 2.5284527124618262e-08, 'epoch': 0.98}
 98%|█████████▊| 3430/3507 [1:25:15<01:48,  1.41s/it]tensor([[-5.6562, -4.4375, -0.3535,  1.3828, -3.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:01,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 130.06 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.2344,  1.2188,  2.7969, -1.2109, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6250, -5.5625, -1.8281,  0.1025, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.0625, -4.1562,  0.2285,  2.8125, -2.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.4375,  1.1250,  3.6719, -2.5469, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[ 0.2930,  2.5781,  3.9844,  1.7812, -0.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -6.0938, -1.4922,  2.7344, -3.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.5156, -4.2188, -2.4531,  1.7422, -0.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:10:04,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:10:04,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.16 | bwd_microstep: 2048.51 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 2047.44 | step_microstep: 1.61
[2025-11-06 19:10:04,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.24 | bwd: 2049.42 | bwd_inner: 1.78 | bwd_allreduce: 2047.49 | step: 1.69
 98%|█████████▊| 3431/3507 [1:25:18<02:28,  1.96s/it]                                                     {'loss': 0.142, 'learning_rate': 2.4632318712646264e-08, 'epoch': 0.98}
 98%|█████████▊| 3431/3507 [1:25:18<02:28,  1.96s/it]tensor([[-6.4688, -2.7500,  1.7344, -1.6016, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-1.3438, -0.7148,  2.4844,  4.8438,  0.2100]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0000, -3.4688,  2.7031,  2.8438, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -1.7031,  3.7500, -0.3633, -5.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:05,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.70 | bwd_microstep: 13.55 | bwd_inner_microstep: 13.41 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.10
tensor([[-5.0312, -0.0693,  4.4062, -1.7422, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.6562, -5.7812, -1.7031,  1.0000, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0000, -3.6250,  0.7578,  0.5547, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2812, -5.2188, -0.7695,  3.7344, -2.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:05,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:10:05,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.49 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 0.83 | step_microstep: 1.96
[2025-11-06 19:10:05,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 511.22 | bwd: 15.45 | bwd_inner: 14.43 | bwd_allreduce: 0.88 | step: 2.06
 98%|█████████▊| 3432/3507 [1:25:19<01:55,  1.54s/it]                                                     {'loss': 0.2314, 'learning_rate': 2.3988621982148353e-08, 'epoch': 0.98}
 98%|█████████▊| 3432/3507 [1:25:19<01:55,  1.54s/it]tensor([[-4.9375, -2.0469,  2.1406,  0.2812, -4.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.9531,  0.4453,  1.0391, -0.5703, -1.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:10:05,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.70 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.6250, -4.3750,  0.0737,  2.0000, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7812, -0.1582,  1.7969, -0.4531, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.2500, -1.3281,  1.8047, -2.8750, -5.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9844, -1.0938,  2.0000,  0.0070, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.3750, -5.2188,  0.0752,  0.6211, -5.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8594, -4.0938, -0.7539,  3.5938, -1.1328]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:10:06,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.16 | optimizer_step: 0.16
[2025-11-06 19:10:06,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 72.21 | bwd_microstep: 1032.28 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 1031.23 | step_microstep: 1.84
[2025-11-06 19:10:06,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 241.89 | bwd: 1033.04 | bwd_inner: 1.65 | bwd_allreduce: 1031.27 | step: 1.91
 98%|█████████▊| 3433/3507 [1:25:20<01:48,  1.47s/it]                                                     {'loss': 0.6081, 'learning_rate': 2.3353437482369624e-08, 'epoch': 0.98}
 98%|█████████▊| 3433/3507 [1:25:20<01:48,  1.47s/it]tensor([[-3.2188, -2.4375, -0.2354,  1.4766, -1.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -2.4531,  0.3047,  1.7266, -1.8672]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.4688, -3.2500,  0.9609,  0.4727, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.3438, -1.5547,  2.4375, -0.8672, -4.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2109,  2.6094,  3.3750, -1.7734, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:10:07,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.84 | bwd_microstep: 0.69 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -3.6562,  0.4219,  1.2656, -3.4688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -1.3984,  1.8906,  0.8359, -3.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.2656,  1.6875,  2.8594, -2.0469, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:10:07,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:10:07,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.51 | bwd_microstep: 2.12 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.04
[2025-11-06 19:10:07,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 447.38 | bwd: 2.81 | bwd_inner: 1.80 | bwd_allreduce: 0.86 | step: 2.12
 98%|█████████▊| 3434/3507 [1:25:21<01:25,  1.18s/it]                                                     {'loss': 0.5439, 'learning_rate': 2.272676575529431e-08, 'epoch': 0.98}
 98%|█████████▊| 3434/3507 [1:25:21<01:25,  1.18s/it]tensor([[-2.3594, -2.4688, -0.5742,  2.7656, -0.2832]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:07,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.31 | bwd_microstep: 1.21 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.0625, -3.1250,  2.0156,  0.7500, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-0.9492,  3.0469,  4.0312, -1.6641, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7344, -3.2031,  0.9258,  4.3125, -1.2891]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.7188, -3.1406,  2.7031, -1.9922, -7.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9062, -4.7812,  1.1250,  1.9062, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3125, -4.6250, -0.1426,  3.0938, -2.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.3594,  1.1172,  4.3438, -1.2656, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:10:09,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.18 | optimizer_step: 0.29
[2025-11-06 19:10:09,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.46 | bwd_microstep: 1798.94 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 1797.94 | step_microstep: 2.09
[2025-11-06 19:10:09,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.78 | bwd: 1800.15 | bwd_inner: 1.99 | bwd_allreduce: 1798.00 | step: 2.19
 98%|█████████▊| 3435/3507 [1:25:23<01:57,  1.63s/it]                                                     {'loss': 0.148, 'learning_rate': 2.2108607335642463e-08, 'epoch': 0.98}
 98%|█████████▊| 3435/3507 [1:25:23<01:57,  1.63s/it]tensor([[-7.0625, -6.4062, -1.7031,  1.4062, -4.0312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -0.5039,  4.0625, -0.8438, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4219, -0.1602,  2.6250, -0.7500, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:10,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.46 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.2500,  0.4062,  3.0625, -0.9961, -3.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.9922,  1.8828,  3.2344, -1.8281, -2.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.3750,  0.1128,  2.6562, -3.0469, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -2.7656,  1.8359, -0.2715, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-0.8242,  2.6094,  3.0000, -1.6250, -1.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
[2025-11-06 19:10:10,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:10:10,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.16 | bwd_microstep: 31.25 | bwd_inner_microstep: 0.98 | bwd_allreduce_microstep: 30.17 | step_microstep: 2.21
[2025-11-06 19:10:10,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.65 | bwd: 32.21 | bwd_inner: 1.84 | bwd_allreduce: 30.22 | step: 2.30
 98%|█████████▊| 3436/3507 [1:25:24<01:29,  1.27s/it]                                                     {'loss': 0.4668, 'learning_rate': 2.1498962750869933e-08, 'epoch': 0.98}
 98%|█████████▊| 3436/3507 [1:25:24<01:29,  1.27s/it]tensor([[-7.3125, -5.5938, -0.5312,  0.8438, -4.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -0.4707,  3.0781, -1.8438, -4.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:10,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 283.76 | bwd_microstep: 1.00 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-6.9688, -4.6875,  1.0625,  1.4844, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.8125, -4.2812,  1.6484,  1.5781, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -1.8516,  0.9219, -0.8594, -3.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.2500,  1.6016,  2.2500, -2.8750, -3.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
tensor([[-4.3125, -1.8359,  1.9375,  1.2734, -3.1406]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9688, -5.2812,  0.7383,  2.4375, -4.3438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:10:12,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.18 | optimizer_step: 0.22
[2025-11-06 19:10:12,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 120.24 | bwd_microstep: 370.25 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 369.16 | step_microstep: 1.99
[2025-11-06 19:10:12,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 404.02 | bwd: 371.24 | bwd_inner: 1.89 | bwd_allreduce: 369.21 | step: 2.07
 98%|█████████▊| 3437/3507 [1:25:25<01:38,  1.40s/it]                                                     {'loss': 0.598, 'learning_rate': 2.0897832521169505e-08, 'epoch': 0.98}
 98%|█████████▊| 3437/3507 [1:25:25<01:38,  1.40s/it]tensor([[-4.4375, -3.9219, -0.2598,  2.7969, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.0938, -3.5781, -2.2500,  1.2969, -0.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.7188, -2.5000, -1.4141,  2.7031,  0.5430]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-3.4375, -0.5625,  2.2500,  0.4336, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:12,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.16 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-4.7812, -2.3906,  1.8281,  1.4609, -3.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1250, -1.6328,  4.2188, -0.2656, -5.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.6875, -3.6875,  2.2500,  1.0156, -5.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0625, -3.3281,  1.3281,  0.2354, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:12,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.15 | optimizer_step: 0.17
[2025-11-06 19:10:12,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.98 | bwd_microstep: 1.67 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 1.96
[2025-11-06 19:10:12,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 553.17 | bwd: 2.59 | bwd_inner: 1.64 | bwd_allreduce: 0.81 | step: 2.03
 98%|█████████▊| 3438/3507 [1:25:26<01:20,  1.16s/it]                                                     {'loss': 0.4789, 'learning_rate': 2.0305217159466428e-08, 'epoch': 0.98}
 98%|█████████▊| 3438/3507 [1:25:26<01:20,  1.16s/it]tensor([[-5.3750, -4.4062, -0.3418,  2.0469, -2.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -2.9375,  1.6094,  1.2969, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:12,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.03 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-5.5938, -4.6250, -0.3281,  2.0312, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.4688, -4.3750, -0.6875,  3.2656, -1.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.9688, -2.7500,  2.6250,  0.7734, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -3.7188,  0.5586,  2.8438, -2.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -1.2188,  2.4531, -0.8164, -4.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.2031,  0.4746,  2.0469, -1.9766, -3.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:10:15,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.22 | optimizer_step: 0.19
[2025-11-06 19:10:15,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.00 | bwd_microstep: 1977.37 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 1976.43 | step_microstep: 2.17
[2025-11-06 19:10:15,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 355.05 | bwd: 1978.27 | bwd_inner: 1.57 | bwd_allreduce: 1976.52 | step: 2.26
 98%|█████████▊| 3439/3507 [1:25:28<01:43,  1.53s/it]                                                     {'loss': 0.8924, 'learning_rate': 1.9721117171420668e-08, 'epoch': 0.98}
 98%|█████████▊| 3439/3507 [1:25:28<01:43,  1.53s/it]tensor([[-2.7656,  1.2422,  3.2656, -1.4922, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:15,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.37 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.2500, -4.6875, -1.7500,  2.7500, -1.3203]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.3594, -0.0801,  3.0000,  0.3418, -3.1094]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -1.9609,  1.4453, -0.1143, -3.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.6250,  2.5312,  3.4531, -2.3906, -2.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.9062, -2.9062,  1.1172,  1.3516, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0625, -5.5938, -1.7109,  1.6094, -3.2188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[1.5625, 2.6250, 4.9688, 5.9688, 2.0156]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:10:15,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.16 | optimizer_gradients: 0.20 | optimizer_step: 0.19
[2025-11-06 19:10:15,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.46 | bwd_microstep: 128.19 | bwd_inner_microstep: 1.05 | bwd_allreduce_microstep: 127.03 | step_microstep: 2.13
[2025-11-06 19:10:15,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 274.84 | bwd: 129.06 | bwd_inner: 1.81 | bwd_allreduce: 127.09 | step: 2.22
 98%|█████████▊| 3440/3507 [1:25:29<01:20,  1.20s/it]                                                     {'loss': 0.3555, 'learning_rate': 1.91455330554291e-08, 'epoch': 0.98}
 98%|█████████▊| 3440/3507 [1:25:29<01:20,  1.20s/it]tensor([[-4.6875, -1.5859,  1.7578, -0.2832, -3.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:15,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 147.39 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-3.7344, -0.3926,  2.7031, -0.8633, -3.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6875, -2.4219,  2.4688,  0.7031, -4.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.0000, -5.4062, -2.4375,  1.8984, -1.9922]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3750, -3.6250,  0.9688,  2.0625, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -0.4492,  4.1562, -1.5156, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1875, -5.6875, -2.6250,  1.8359, -2.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.1875, -4.0312,  0.1807,  2.4219, -2.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:10:17,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.26 | optimizer_gradients: 0.16 | optimizer_step: 0.17
[2025-11-06 19:10:17,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 151.74 | bwd_microstep: 1751.71 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 1750.55 | step_microstep: 1.83
[2025-11-06 19:10:17,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 299.14 | bwd: 1752.59 | bwd_inner: 1.85 | bwd_allreduce: 1750.60 | step: 1.92
 98%|█████████▊| 3441/3507 [1:25:31<01:36,  1.47s/it]                                                     {'loss': 0.2468, 'learning_rate': 1.8578465302618864e-08, 'epoch': 0.98}
 98%|█████████▊| 3441/3507 [1:25:31<01:36,  1.47s/it]tensor([[-0.7344,  2.5938,  2.1250, -2.3594, -1.8828]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.2500, -1.5703,  2.1875,  0.9609, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.5625, -5.3750,  0.8477, -0.5234, -6.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:17,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.32 | bwd_microstep: 1.07 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-6.7500, -5.2188,  0.1992,  2.1562, -4.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.0312, -3.5469,  0.5117,  3.7344, -1.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -0.1396,  3.9375, -1.4688, -4.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.6250, -2.2656,  1.4219, -1.8281, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[h264 @ 0x98034c0] mmco: unref short failure
tensor([[-5.3125, -0.3711,  3.7344, -2.4688, -5.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:10:18,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.19 | optimizer_step: 0.28
[2025-11-06 19:10:18,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.35 | bwd_microstep: 752.04 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 751.00 | step_microstep: 2.04
[2025-11-06 19:10:18,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 421.70 | bwd: 753.11 | bwd_inner: 1.91 | bwd_allreduce: 751.04 | step: 2.13
 98%|█████████▊| 3442/3507 [1:25:32<01:30,  1.39s/it]                                                     {'loss': 0.1638, 'learning_rate': 1.801991439685291e-08, 'epoch': 0.98}
 98%|█████████▊| 3442/3507 [1:25:32<01:30,  1.39s/it]tensor([[-4.4688, -2.0781,  0.4785, -0.7305, -3.5781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -2.7500,  1.8984,  1.1406, -3.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.4688, -4.5938, -0.8789,  3.3594, -1.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:19,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.81 | bwd_microstep: 0.98 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.5938, -0.2793,  3.3906, -1.3516, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.7188, -2.5156,  1.2812,  0.9141, -3.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0312, -1.3672,  2.4219,  1.0000, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.4062, -2.2188,  2.4531,  0.0342, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3438, -4.2188,  0.3047,  2.5312, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:10:25,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.60 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 19:10:25,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 123.49 | bwd_microstep: 5846.71 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 5845.60 | step_microstep: 2.69
[2025-11-06 19:10:25,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 352.34 | bwd: 5847.69 | bwd_inner: 1.90 | bwd_allreduce: 5845.65 | step: 2.79
 98%|█████████▊| 3443/3507 [1:25:38<03:02,  2.85s/it]                                                     {'loss': 0.4189, 'learning_rate': 1.746988081472556e-08, 'epoch': 0.98}
 98%|█████████▊| 3443/3507 [1:25:38<03:02,  2.85s/it]tensor([[-4.9375, -0.5547,  3.3750, -1.6328, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:25,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.66 | bwd_microstep: 1.18 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-2.1250,  2.2500,  2.8594, -3.2500, -3.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-3.8906, -4.4062, -1.5703,  2.8594, -1.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6250, -5.2500, -0.4609,  1.4766, -4.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5781, -1.0469,  1.5469, -0.2051, -3.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8750, -2.2656,  1.5781,  0.3027, -3.8281]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.5625, -4.7812,  0.8398,  2.3750, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7188, -4.2500, -0.5156,  2.6875, -2.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:10:25,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.30 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:10:25,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.70 | bwd_microstep: 108.84 | bwd_inner_microstep: 1.26 | bwd_allreduce_microstep: 107.51 | step_microstep: 1.58
[2025-11-06 19:10:25,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.39 | bwd: 110.02 | bwd_inner: 2.34 | bwd_allreduce: 107.55 | step: 1.66
 98%|█████████▊| 3444/3507 [1:25:39<02:13,  2.12s/it]                                                     {'loss': 0.8092, 'learning_rate': 1.692836502556472e-08, 'epoch': 0.98}
 98%|█████████▊| 3444/3507 [1:25:39<02:13,  2.12s/it]tensor([[-5.9062, -2.9531,  1.8672,  0.5156, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:25,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 59.66 | bwd_microstep: 1.05 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.4062, -2.0469,  2.7812, -2.0312, -6.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -4.9062, -0.4512,  1.6172, -3.5938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1875, -7.1562, -1.4062,  1.9844, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -2.1562,  1.5547, -0.3691, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.1562, -0.0693,  4.2188, -0.0576, -4.1875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.8438, -2.5000,  1.1484,  0.6484, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.4531,  0.2676,  1.5547, -2.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:10:25,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:10:25,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.93 | bwd_microstep: 142.40 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 141.29 | step_microstep: 1.75
[2025-11-06 19:10:25,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 226.60 | bwd: 143.45 | bwd_inner: 1.99 | bwd_allreduce: 141.33 | step: 1.83
 98%|█████████▊| 3445/3507 [1:25:39<01:39,  1.61s/it]                                                     {'loss': 0.5936, 'learning_rate': 1.639536749142745e-08, 'epoch': 0.98}
 98%|█████████▊| 3445/3507 [1:25:39<01:39,  1.61s/it]tensor([[-3.6250, -3.1250, -0.3965,  2.0156, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1250, -4.0625, -0.0240,  2.0625, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2500, -2.0000,  2.6094,  0.4004, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7031, -0.2275,  3.4375,  4.5312, -0.5039]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:26,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.70 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.3125, -1.0156,  2.9062, -1.9531, -5.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.9375, -0.7383,  0.9609, -0.5234, -2.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-1.5000,  2.1719,  3.6406, -0.6719, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1562, -3.5469,  0.8242,  2.0781, -3.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:26,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.16 | optimizer_step: 0.18
[2025-11-06 19:10:26,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.53 | bwd_microstep: 3.81 | bwd_inner_microstep: 0.71 | bwd_allreduce_microstep: 3.02 | step_microstep: 1.78
[2025-11-06 19:10:26,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 392.26 | bwd: 4.80 | bwd_inner: 1.56 | bwd_allreduce: 3.07 | step: 1.88
 98%|█████████▊| 3446/3507 [1:25:40<01:16,  1.26s/it]                                                     {'loss': 0.6812, 'learning_rate': 1.587088866710551e-08, 'epoch': 0.98}
 98%|█████████▊| 3446/3507 [1:25:40<01:16,  1.26s/it]tensor([[-3.9688, -3.5469, -0.7656,  1.8906, -1.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:26,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 132.31 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.1719, -2.7969,  0.5039,  3.5469, -1.0234]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531, -3.2188,  1.1406,  3.8594, -1.6797]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.3438,  0.8438,  1.6641, -1.5078, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5625, -2.2500,  1.6797, -0.8125, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2500, -4.8438,  0.9180,  3.2500, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.0938, -1.5156,  1.8516,  0.5156, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -3.9844,  1.6172, -0.0249, -5.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:10:26,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.15 | optimizer_step: 0.15
[2025-11-06 19:10:26,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.70 | bwd_microstep: 95.32 | bwd_inner_microstep: 1.45 | bwd_allreduce_microstep: 93.78 | step_microstep: 1.67
[2025-11-06 19:10:26,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 374.04 | bwd: 96.16 | bwd_inner: 2.20 | bwd_allreduce: 93.82 | step: 1.75
 98%|█████████▊| 3447/3507 [1:25:40<01:01,  1.03s/it]                                                     {'loss': 0.1604, 'learning_rate': 1.535492900012203e-08, 'epoch': 0.98}
 98%|█████████▊| 3447/3507 [1:25:40<01:01,  1.03s/it]tensor([[-5.1562, -3.0781,  1.5391,  1.9844, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.1250, -4.7812,  0.0549,  2.0781, -3.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -5.0625, -1.2812,  2.0938, -2.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8750, -5.2500, -1.8359,  2.8438, -1.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:29,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.54 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.8281, -3.3438,  0.3379,  3.2656, -1.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.7812, -0.7422,  3.0312, -1.3438, -4.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8438, -1.1328,  2.7031,  1.2031, -3.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3125, -2.9062,  1.5078,  1.2656, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:10:29,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.34 | optimizer_gradients: 0.15 | optimizer_step: 0.19
[2025-11-06 19:10:29,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.83 | bwd_microstep: 29.42 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 28.32 | step_microstep: 1.83
[2025-11-06 19:10:29,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 366.40 | bwd: 30.40 | bwd_inner: 1.87 | bwd_allreduce: 28.37 | step: 1.94
 98%|█████████▊| 3448/3507 [1:25:43<01:35,  1.62s/it]                                                     {'loss': 0.3876, 'learning_rate': 1.4847488930728182e-08, 'epoch': 0.98}
 98%|█████████▊| 3448/3507 [1:25:43<01:35,  1.62s/it]tensor([[-5.2812, -5.0938, -1.4688,  2.0312, -2.4688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -3.7031, -1.3438,  3.2031, -0.4277]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.3594,  1.8594,  3.7656, -1.5547, -3.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:10:30,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 295.61 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-4.3125, -4.5000, -1.6641,  2.1875, -1.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.0312,  2.2500,  3.7969, -1.9297, -3.1719]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-5.0625, -2.2969,  1.7969,  0.7266, -3.8750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([2], device='cuda:0')
tensor([[-4.7188, -2.9531,  1.1016,  1.4766, -3.1406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.3750, -5.2812, -1.1484,  2.9688, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:10:30,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:10:30,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.21 | bwd_microstep: 501.64 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 500.52 | step_microstep: 2.16
[2025-11-06 19:10:30,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 462.85 | bwd: 502.53 | bwd_inner: 1.83 | bwd_allreduce: 500.57 | step: 2.24
 98%|█████████▊| 3449/3507 [1:25:44<01:23,  1.43s/it]                                                     {'loss': 0.4053, 'learning_rate': 1.4348568891908721e-08, 'epoch': 0.98}
 98%|█████████▊| 3449/3507 [1:25:44<01:23,  1.43s/it]tensor([[-2.0625, -2.9688, -1.0156,  3.8125,  0.4395]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -3.3906,  1.3594,  0.7617, -4.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0312, -0.8438,  3.1250, -1.5625, -5.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.4375, -1.7109,  3.6094,  0.6562, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1875, -3.5312,  0.9375,  2.1562, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6250, -3.8594, -0.1318,  2.1250, -2.3906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:32,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 144.59 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.63 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.8438, -4.2812,  0.7070,  2.3125, -3.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.8438, -4.6875,  1.2344,  1.9922, -4.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:10:32,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.24 | optimizer_step: 0.17
[2025-11-06 19:10:32,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.34 | bwd_microstep: 37.55 | bwd_inner_microstep: 4.91 | bwd_allreduce_microstep: 32.55 | step_microstep: 2.12
[2025-11-06 19:10:32,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.95 | bwd: 38.29 | bwd_inner: 5.55 | bwd_allreduce: 32.59 | step: 2.20
 98%|█████████▊| 3450/3507 [1:25:46<01:31,  1.61s/it]                                                     {'loss': 0.191, 'learning_rate': 1.3858169309376446e-08, 'epoch': 0.98}
 98%|█████████▊| 3450/3507 [1:25:46<01:31,  1.61s/it]tensor([[-4.4688, -1.8828,  2.4219,  1.5547, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7344,  0.1504,  1.9062, -2.7812, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.3750, -4.8438, -0.2490,  1.1250, -4.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8438, -3.2500,  2.0312,  1.7109, -4.1562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:33,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 305.12 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5625, -2.6875,  3.0469, -0.3926, -5.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-2.5000,  1.8125,  3.0938, -2.9688, -3.6719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8438, -5.8750, -1.4531, -0.9414, -5.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.5312, -0.4082,  1.9844, -0.6836, -3.2969]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:10:33,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.35 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:10:33,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.66 | bwd_microstep: 414.76 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 413.92 | step_microstep: 1.95
[2025-11-06 19:10:33,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 502.81 | bwd: 415.43 | bwd_inner: 1.33 | bwd_allreduce: 413.97 | step: 2.03
 98%|█████████▊| 3451/3507 [1:25:47<01:19,  1.42s/it]                                                     {'loss': 0.388, 'learning_rate': 1.3376290601574416e-08, 'epoch': 0.98}
 98%|█████████▊| 3451/3507 [1:25:47<01:19,  1.42s/it]tensor([[-5.1875, -3.5469,  1.0078,  2.1406, -3.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.6562, -4.7188,  0.9844,  1.8203, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.4375, -5.5625,  1.0312,  0.6836, -6.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.1719,  1.5781,  3.0469, -1.7422, -3.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.8750, -5.0312,  0.9102,  2.4531, -4.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.7188, -6.7812, -2.0000,  1.0625, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:35,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.89 | bwd_microstep: 10.06 | bwd_inner_microstep: 9.93 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.11
tensor([[-3.9844, -3.6719, -0.2656,  2.7656, -1.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3750, -5.2188, -0.8984,  3.3906, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:10:36,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:10:36,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.28 | bwd_microstep: 134.28 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 133.38 | step_microstep: 1.99
[2025-11-06 19:10:36,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 356.19 | bwd: 144.34 | bwd_inner: 10.77 | bwd_allreduce: 133.43 | step: 2.10
 98%|█████████▊| 3452/3507 [1:25:50<01:35,  1.74s/it]                                                     {'loss': 0.2158, 'learning_rate': 1.2902933179675947e-08, 'epoch': 0.98}
 98%|█████████▊| 3452/3507 [1:25:50<01:35,  1.74s/it]tensor([[-1.3281,  1.7812,  3.9219,  0.9219, -1.6641]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.8750, -5.1875, -0.5039,  2.6250, -3.0781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-4.6562, -1.8828,  1.5078, -0.4492, -3.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>)tensor([3], device='cuda:0')
 tensor([2], device='cuda:1')
[2025-11-06 19:10:36,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.00 | bwd_microstep: 0.72 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-7.5312, -4.4688,  1.5391,  0.3926, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4062, -0.5586,  4.4688,  3.2188, -2.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.6250, -0.0586,  4.0625, -1.4219, -5.0312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0312, -1.7734,  3.0938, -1.3047, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.3125, -5.5000, -0.7148,  2.3125, -3.4531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:10:38,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.24 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:10:38,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 148.89 | bwd_microstep: 2284.93 | bwd_inner_microstep: 1.18 | bwd_allreduce_microstep: 2283.66 | step_microstep: 1.79
[2025-11-06 19:10:38,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 323.91 | bwd: 2285.64 | bwd_inner: 1.78 | bwd_allreduce: 2283.71 | step: 1.88
 98%|█████████▊| 3453/3507 [1:25:52<01:48,  2.02s/it]                                                     {'loss': 0.83, 'learning_rate': 1.2438097447581288e-08, 'epoch': 0.98}
 98%|█████████▊| 3453/3507 [1:25:52<01:48,  2.02s/it]tensor([[-5.5938, -4.9062, -0.9727,  2.0625, -2.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -4.1562, -0.8281,  2.8438, -1.6328]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.3906,  1.2031,  2.6719, -1.4609, -2.9531]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8125, -3.4375,  0.4824,  1.6875, -2.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:39,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 128.90 | bwd_microstep: 2.98 | bwd_inner_microstep: 2.85 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-4.4375, -3.9844,  0.1797,  3.6250, -1.8359]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.3594,  0.0781,  3.0000, -0.7148, -3.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.0000, -5.6250, -0.5703,  1.6562, -4.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.3438,  0.4473,  3.9062, -0.4902, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:10:40,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.83 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:10:40,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.20 | bwd_microstep: 909.50 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 908.48 | step_microstep: 2.85
[2025-11-06 19:10:40,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 281.11 | bwd: 912.48 | bwd_inner: 3.78 | bwd_allreduce: 908.54 | step: 2.95
 98%|█████████▊| 3454/3507 [1:25:54<01:44,  1.98s/it]                                                     {'loss': 0.2578, 'learning_rate': 1.1981783801923163e-08, 'epoch': 0.98}
 98%|█████████▊| 3454/3507 [1:25:54<01:44,  1.98s/it]tensor([[-3.1562,  0.6992,  3.1406, -1.4844, -3.6875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.1406,  1.9922,  1.7578, -2.0312, -1.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.2812, -2.2344,  1.7031,  1.9766, -2.7969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:41,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.60 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.86 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-7.7812, -4.1250,  2.1094, -0.3789, -6.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[0.8008, 0.4785, 2.1562, 5.8438, 2.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.8125, -0.4551,  3.6562, -1.0859, -4.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.0625, -0.4570,  2.7344, -0.7695, -3.9844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7344, -1.3828,  1.9688,  1.1328, -2.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:10:41,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 19:10:41,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.32 | bwd_microstep: 413.09 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 411.97 | step_microstep: 1.94
[2025-11-06 19:10:41,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.95 | bwd: 414.05 | bwd_inner: 1.90 | bwd_allreduce: 412.01 | step: 2.03
 99%|█████████▊| 3455/3507 [1:25:55<01:24,  1.63s/it]                                                     {'loss': 0.2417, 'learning_rate': 1.153399263205901e-08, 'epoch': 0.99}
 99%|█████████▊| 3455/3507 [1:25:55<01:24,  1.63s/it]tensor([[-3.0781,  1.1172,  3.1562, -2.1562, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9219, -3.1562, -0.1465,  1.6875, -2.0781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3750, -3.3125,  0.0503, -3.9688, -6.7500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6562, -4.7188, -1.2656,  0.7305, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6562, -4.4062,  1.2344,  1.5938, -4.5625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.5156, -3.9844, -1.5938,  2.4531, -0.9688]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.7500, -6.7500, -1.2656,  1.8203, -4.5000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:43,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.10 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2500, -5.5000, -1.7656,  2.7344, -2.1406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:46,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 0.12 | optimizer_step: 0.18
[2025-11-06 19:10:46,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.30 | bwd_microstep: 1.97 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.69
[2025-11-06 19:10:46,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 439.42 | bwd: 2.95 | bwd_inner: 1.92 | bwd_allreduce: 0.91 | step: 2.78
 99%|█████████▊| 3456/3507 [1:26:00<02:17,  2.70s/it]                                                     {'loss': 0.1445, 'learning_rate': 1.1094724320074301e-08, 'epoch': 0.99}
 99%|█████████▊| 3456/3507 [1:26:00<02:17,  2.70s/it]tensor([[-1.9531,  1.5859,  3.0469, -1.2422, -2.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.4062, -4.5312, -0.4570,  3.9844, -1.4844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.8906,  1.2266,  3.5938, -1.9844, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1875, -4.2812, -0.8594,  3.1562, -1.4922]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.1406,  2.8281,  3.0156, -2.2188, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5938, -5.5625, -0.6914,  1.7969, -3.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:47,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 375.80 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3750, -4.1875,  0.3203,  2.2344, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.2188, -1.3359,  3.3438, -0.3125, -4.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:47,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.18 | optimizer_step: 0.18
[2025-11-06 19:10:47,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.91 | bwd_microstep: 2.11 | bwd_inner_microstep: 1.12 | bwd_allreduce_microstep: 0.89 | step_microstep: 2.36
[2025-11-06 19:10:47,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 566.75 | bwd: 3.05 | bwd_inner: 1.98 | bwd_allreduce: 0.92 | step: 2.44
 99%|█████████▊| 3457/3507 [1:26:01<01:43,  2.07s/it]                                                     {'loss': 0.3383, 'learning_rate': 1.0663979240784772e-08, 'epoch': 0.99}
 99%|█████████▊| 3457/3507 [1:26:01<01:43,  2.07s/it]tensor([[-6.6562, -4.0312,  1.8906,  1.6641, -4.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.4375, -4.2812, -0.7891,  2.6719, -1.8516]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.2188,  1.9062,  1.9297, -1.7266, -1.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:10:47,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 138.59 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.5625, -4.7500,  0.0593, -1.3125, -5.9688]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0156, -2.2656,  1.8203,  4.5625, -0.9531]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-7.3438, -6.7812, -2.5312,  0.7812, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -3.4219,  2.0781, -0.5820, -5.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3438, -4.5312,  0.9414,  2.1250, -4.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:10:47,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:10:47,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.04 | bwd_microstep: 171.15 | bwd_inner_microstep: 0.90 | bwd_allreduce_microstep: 170.19 | step_microstep: 1.71
[2025-11-06 19:10:47,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 305.65 | bwd: 171.89 | bwd_inner: 1.55 | bwd_allreduce: 170.22 | step: 1.78
 99%|█████████▊| 3458/3507 [1:26:01<01:18,  1.60s/it]                                                     {'loss': 0.435, 'learning_rate': 1.0241757761733084e-08, 'epoch': 0.99}
 99%|█████████▊| 3458/3507 [1:26:01<01:18,  1.60s/it]tensor([[-7.9688, -4.9688,  1.0938,  0.3184, -5.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:48,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 126.44 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.5938, -1.6250,  2.2188,  0.7695, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.2500, -0.6914,  4.0625,  1.3281, -3.7344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -3.5312,  0.1050,  1.8906, -2.5469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -2.5469,  2.1406,  0.0737, -4.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4062, -1.3672,  3.4062, -0.5898, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.1562, -5.9062, -1.0469,  1.2578, -4.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9531,  0.8438,  4.4062, -1.8203, -4.7812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:10:48,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.20 | optimizer_step: 0.20
[2025-11-06 19:10:48,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.84 | bwd_microstep: 125.59 | bwd_inner_microstep: 1.01 | bwd_allreduce_microstep: 124.50 | step_microstep: 1.91
[2025-11-06 19:10:48,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 361.29 | bwd: 126.42 | bwd_inner: 1.76 | bwd_allreduce: 124.54 | step: 2.00
 99%|█████████▊| 3459/3507 [1:26:02<01:01,  1.28s/it]                                                     {'loss': 0.4017, 'learning_rate': 9.82806024318661e-09, 'epoch': 0.99}
 99%|█████████▊| 3459/3507 [1:26:02<01:01,  1.28s/it]tensor([[-3.6406, -1.1328,  2.0938,  1.0000, -2.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -3.0938,  0.8984,  1.2891, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3438, -4.7500, -1.1406,  3.4688, -1.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.8281, -4.2812, -1.4531,  2.7969, -1.0859]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5625, -4.7188, -0.1108,  2.7031, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.7812, -3.4844,  0.4473,  4.0625, -1.2969]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:48,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.40 | bwd_microstep: 0.93 | bwd_inner_microstep: 0.81 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-7.0938, -5.0312,  1.1094,  2.0469, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -2.2656,  1.9766,  0.7188, -3.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:50,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.17 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:10:50,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.36 | bwd_microstep: 1.93 | bwd_inner_microstep: 1.03 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.94
[2025-11-06 19:10:50,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 382.77 | bwd: 2.86 | bwd_inner: 1.86 | bwd_allreduce: 0.87 | step: 3.03
 99%|█████████▊| 3460/3507 [1:26:04<01:11,  1.52s/it]                                                     {'loss': 0.4539, 'learning_rate': 9.42288703814187e-09, 'epoch': 0.99}
 99%|█████████▊| 3460/3507 [1:26:04<01:11,  1.52s/it]tensor([[-6.0938, -3.0000,  2.1406,  0.4512, -4.8438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.6250, -0.2031,  2.2031, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:50,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.72 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.6562, -5.3125, -1.6641,  1.3828, -2.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.9219, -4.5312, -1.7109,  3.1094, -1.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.6250, -4.3125,  1.8672,  2.3906, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-7.0312, -3.5469,  2.4844,  0.1963, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-8.4375, -6.1562, -0.1104,  0.5977, -5.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6250, -1.5312,  2.3125, -0.0420, -3.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:10:51,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.40 | optimizer_gradients: 0.19 | optimizer_step: 0.18
[2025-11-06 19:10:51,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.04 | bwd_microstep: 560.33 | bwd_inner_microstep: 0.93 | bwd_allreduce_microstep: 559.31 | step_microstep: 1.99
[2025-11-06 19:10:51,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 395.79 | bwd: 560.99 | bwd_inner: 1.50 | bwd_allreduce: 559.35 | step: 2.07
 99%|█████████▊| 3461/3507 [1:26:05<01:02,  1.36s/it]                                                     {'loss': 0.4565, 'learning_rate': 9.026238492321204e-09, 'epoch': 0.99}
 99%|█████████▊| 3461/3507 [1:26:05<01:02,  1.36s/it]tensor([[-3.7344, -0.1270,  3.0469, -0.8867, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.4844, -3.7188, -1.3828,  2.5000, -0.9961]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5000e+00, -5.2812e+00,  1.2054e-03,  2.4219e+00, -3.8125e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:51,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.58 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-6.4062, -2.3750,  2.7969, -0.8633, -5.7500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.2344,  2.0000,  3.3750, -2.3906, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-6.2188, -5.7500, -1.5156,  1.9141, -3.2969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.7500, -2.5000,  1.5156,  1.1094, -3.4219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0000, -2.6250,  2.5312,  2.6094, -3.3594]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:53,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:10:53,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.85 | bwd_microstep: 2.09 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.88 | step_microstep: 2.41
[2025-11-06 19:10:53,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 313.46 | bwd: 2.84 | bwd_inner: 1.81 | bwd_allreduce: 0.90 | step: 2.49
 99%|█████████▊| 3462/3507 [1:26:07<01:06,  1.47s/it]                                                     {'loss': 0.3836, 'learning_rate': 8.638114944171661e-09, 'epoch': 0.99}
 99%|█████████▊| 3462/3507 [1:26:07<01:06,  1.47s/it]tensor([[-4.4062, -3.6406, -0.3262,  1.8281, -2.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -0.3066,  3.8750, -0.8945, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.0781, -3.2500, -0.7812,  2.8438, -0.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:53,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.11 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.6875, -3.0781,  1.3359,  2.4375, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1250, -2.8281,  1.3203,  3.0469, -2.2031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6250, -4.0938,  0.7461,  2.4844, -3.3281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9375, -3.4062,  1.0938,  2.5469, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.5000, -4.3125,  0.5430,  0.7227, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:54,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.18 | optimizer_step: 0.19
[2025-11-06 19:10:54,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.25 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.84 | step_microstep: 2.69
[2025-11-06 19:10:54,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 415.40 | bwd: 2.68 | bwd_inner: 1.65 | bwd_allreduce: 0.88 | step: 2.77
 99%|█████████▊| 3463/3507 [1:26:07<00:55,  1.27s/it]                                                     {'loss': 0.2089, 'learning_rate': 8.258516724868326e-09, 'epoch': 0.99}
 99%|█████████▊| 3463/3507 [1:26:07<00:55,  1.27s/it]tensor([[-4.7188, -3.0625,  1.0391,  2.1250, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -1.9453,  1.2656, -0.5820, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5312, -4.3125,  0.2041,  2.2812, -3.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:54,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.83 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.83 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-0.9297,  2.9219,  2.2031, -2.8750, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-5.6875, -4.6875,  0.1523,  2.7188, -3.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -4.7188, -1.8516,  1.9453, -1.8359]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.8125, -4.2500, -0.5547,  2.2656, -2.4062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9219, -3.2656, -2.4844,  2.2969,  0.5586]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
[2025-11-06 19:10:56,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.79 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 19:10:56,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.59 | bwd_microstep: 2.07 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.84 | step_microstep: 3.01
[2025-11-06 19:10:56,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 345.44 | bwd: 3.01 | bwd_inner: 1.97 | bwd_allreduce: 0.88 | step: 3.10
 99%|█████████▉| 3464/3507 [1:26:10<01:10,  1.64s/it]                                                     {'loss': 0.3926, 'learning_rate': 7.887444158310998e-09, 'epoch': 0.99}
 99%|█████████▉| 3464/3507 [1:26:10<01:10,  1.64s/it]tensor([[-3.8281, -3.2969, -0.3184,  2.3906, -1.6484]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.1250, -2.8438,  1.7812,  1.6562, -3.5469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:10:56,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.89 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.58 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.3125, -3.9375,  0.5312,  2.4375, -3.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.0312, -3.6094,  1.3672,  1.1406, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.3438, -3.3750,  1.3203,  2.2031, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.0547,  2.8438,  3.4219, -1.9219, -2.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.0938, -3.7812, -0.4023,  3.0938, -1.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1406, -3.8906, -2.4844,  1.7891, -0.5703]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:10:57,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.19 | optimizer_step: 0.21
[2025-11-06 19:10:57,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.87 | bwd_microstep: 683.55 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 682.45 | step_microstep: 1.94
[2025-11-06 19:10:57,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 337.79 | bwd: 684.25 | bwd_inner: 1.61 | bwd_allreduce: 682.50 | step: 2.02
 99%|█████████▉| 3465/3507 [1:26:11<01:01,  1.46s/it]                                                     {'loss': 0.5146, 'learning_rate': 7.524897561124179e-09, 'epoch': 0.99}
 99%|█████████▉| 3465/3507 [1:26:11<01:01,  1.46s/it]tensor([[-4.0938, -3.8594,  0.3789,  4.4688, -1.3047]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.1719, -0.9844,  2.9062,  4.6250, -0.6289]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:57,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.56 | bwd_microstep: 0.83 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-1.1562,  2.7188,  3.1250, -2.2344, -2.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0312, -3.3125,  2.1094,  1.5234, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.7812, -3.3438,  0.8008,  2.3125, -2.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.0781, -2.5938, -0.9844,  3.0625,  0.2695]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:1')
tensor([[-6.8125, -4.0938,  1.8594,  1.1641, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5312, -5.3750, -1.5625,  2.3438, -2.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:59,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.54 | optimizer_gradients: 0.14 | optimizer_step: 0.18
[2025-11-06 19:10:59,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.54 | bwd_microstep: 1.85 | bwd_inner_microstep: 0.96 | bwd_allreduce_microstep: 0.82 | step_microstep: 2.32
[2025-11-06 19:10:59,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 369.09 | bwd: 2.68 | bwd_inner: 1.67 | bwd_allreduce: 0.86 | step: 2.41
 99%|█████████▉| 3466/3507 [1:26:13<01:08,  1.66s/it]                                                     {'loss': 0.5855, 'learning_rate': 7.170877242658192e-09, 'epoch': 0.99}
 99%|█████████▉| 3466/3507 [1:26:13<01:08,  1.66s/it]tensor([[-2.7344, -3.2812, -0.9219,  3.4844, -0.2363]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:10:59,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.19 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-9.3750, -8.4375, -2.9219,  0.0295, -5.8750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -3.3281,  1.5234,  1.3281, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.8906,  1.1016,  1.4922, -3.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -4.4062, -2.4844,  1.7266, -1.0859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.7500, -0.6133,  1.0312, -0.0459, -2.2031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.8281, -3.1875,  1.0859,  4.3438, -1.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.4688, -3.9531,  0.5664,  1.8906, -3.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:11:00,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.19 | optimizer_gradients: 0.14 | optimizer_step: 0.17
[2025-11-06 19:11:00,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.41 | bwd_microstep: 39.94 | bwd_inner_microstep: 1.21 | bwd_allreduce_microstep: 38.66 | step_microstep: 1.46
[2025-11-06 19:11:00,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 341.62 | bwd: 40.79 | bwd_inner: 1.98 | bwd_allreduce: 38.69 | step: 1.53
 99%|█████████▉| 3467/3507 [1:26:13<00:51,  1.29s/it]                                                     {'loss': 0.3306, 'learning_rate': 6.82538350498918e-09, 'epoch': 0.99}
 99%|█████████▉| 3467/3507 [1:26:13<00:51,  1.29s/it]tensor([[-2.2031,  2.0469,  2.9062, -2.8281, -3.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.9062, -5.9688,  0.4980,  2.1875, -5.0625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-7.5000, -6.3125, -0.6523,  2.0938, -4.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:00,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.32 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.5312, -0.1377,  2.7812,  1.8125, -1.9297]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8438, -4.2500, -0.2305,  2.9062, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -4.3750, -1.5781,  2.0156, -1.6641]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-1.9375, -2.2188, -0.0737,  3.6094,  0.2158]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-1.0938,  2.2031,  4.3438,  0.8633, -1.6328]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:01,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.18 | optimizer_step: 0.24
[2025-11-06 19:11:01,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.37 | bwd_microstep: 2.07 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 1.00 | step_microstep: 2.65
[2025-11-06 19:11:01,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 452.71 | bwd: 2.77 | bwd_inner: 1.57 | bwd_allreduce: 1.04 | step: 2.73
 99%|█████████▉| 3468/3507 [1:26:15<00:48,  1.23s/it]                                                     {'loss': 0.1595, 'learning_rate': 6.488416642914663e-09, 'epoch': 0.99}
 99%|█████████▉| 3468/3507 [1:26:15<00:48,  1.23s/it]tensor([[-5.2500, -2.8750,  0.7812,  0.3027, -3.8281]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:01,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 146.75 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.77 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.7500, -4.5000,  0.6445,  2.9688, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.6562, -3.0625, -0.4004,  3.8906, -0.1816]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6250, -2.2812,  3.2031, -1.1875, -6.1562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6875, -3.3438,  1.5234,  1.6016, -3.9219]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6562, -4.7500, -0.1641,  2.6250, -3.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5938, -2.7656,  2.7812, -0.2832, -5.6562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7188, -4.0312,  0.2812,  3.1250, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:11:04,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.18 | optimizer_step: 0.21
[2025-11-06 19:11:04,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.36 | bwd_microstep: 2413.09 | bwd_inner_microstep: 1.37 | bwd_allreduce_microstep: 2411.60 | step_microstep: 2.24
[2025-11-06 19:11:04,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 311.13 | bwd: 2413.97 | bwd_inner: 2.15 | bwd_allreduce: 2411.66 | step: 2.33
 99%|█████████▉| 3469/3507 [1:26:17<01:04,  1.69s/it]                                                     {'loss': 0.1855, 'learning_rate': 6.1599769439590896e-09, 'epoch': 0.99}
 99%|█████████▉| 3469/3507 [1:26:17<01:04,  1.69s/it]tensor([[-3.1719, -4.0312, -2.1094,  2.5469, -0.4434]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.8906, -0.3633,  1.8281, -1.5703, -3.8594]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.9375, -0.5352,  2.3594, -0.6406, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:04,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.06 | bwd_microstep: 0.79 | bwd_inner_microstep: 0.70 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.2812, -4.2188,  0.3730,  2.9375, -2.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-8.1875, -4.9375,  1.5312,  0.0757, -6.3750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5625, -2.6562,  1.4766,  1.2578, -3.2344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.1875, -5.3438, -0.8711,  1.7500, -3.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -4.0312,  0.9727,  0.7344, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:04,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.53 | optimizer_gradients: 0.14 | optimizer_step: 0.14
[2025-11-06 19:11:04,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.81 | bwd_microstep: 1.99 | bwd_inner_microstep: 1.09 | bwd_allreduce_microstep: 0.83 | step_microstep: 3.80
[2025-11-06 19:11:04,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 312.88 | bwd: 2.78 | bwd_inner: 1.80 | bwd_allreduce: 0.86 | step: 3.87
 99%|█████████▉| 3470/3507 [1:26:18<00:50,  1.37s/it]                                                     {'loss': 0.2628, 'learning_rate': 5.840064688370506e-09, 'epoch': 0.99}
 99%|█████████▉| 3470/3507 [1:26:18<00:50,  1.37s/it]tensor([[-5.1250, -4.5938, -0.3945,  2.8906, -2.4062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.2188,  0.1250,  3.4531, -1.7188, -4.6250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:04,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.23 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.4688,  1.0703,  2.6562, -1.2109, -2.9219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-4.8125, -5.0000, -1.2891,  3.2969, -1.7578]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3125, -2.2031,  3.1875, -0.9062, -5.8125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9375, -2.5156,  3.0469,  0.6992, -4.9062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.1875, -0.2695,  2.7188, -1.7109, -4.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7812, -3.9375, -0.1758,  2.1250, -2.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:11:09,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.57 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:11:09,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.10 | bwd_microstep: 3913.62 | bwd_inner_microstep: 1.24 | bwd_allreduce_microstep: 3912.29 | step_microstep: 2.47
[2025-11-06 19:11:09,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 463.35 | bwd: 3914.54 | bwd_inner: 2.08 | bwd_allreduce: 3912.33 | step: 2.55
 99%|█████████▉| 3471/3507 [1:26:22<01:22,  2.29s/it]                                                     {'loss': 0.2729, 'learning_rate': 5.528680149120557e-09, 'epoch': 0.99}
 99%|█████████▉| 3471/3507 [1:26:22<01:22,  2.29s/it]tensor([[-3.6719, -0.3262,  2.5781, -0.6016, -3.5781]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:09,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 111.44 | bwd_microstep: 0.97 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-2.9688, -1.2812,  1.1250,  1.0469, -1.9844]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.4688, -4.4375, -0.1650,  2.2500, -3.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.8438e+00, -4.6875e+00,  4.4556e-03,  2.2168e-01, -4.8125e+00]],
       device='cuda:3', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.0000, -0.5586,  2.0469, -1.2578, -3.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-6.0625, -7.0312, -5.1562, -0.3262, -2.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.6562, -1.2109,  3.0625,  2.3594, -2.6094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -2.4219,  1.8750,  1.1094, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
[2025-11-06 19:11:09,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:11:09,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.61 | bwd_microstep: 133.22 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 132.12 | step_microstep: 1.75
[2025-11-06 19:11:09,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 281.07 | bwd: 134.19 | bwd_inner: 1.90 | bwd_allreduce: 132.16 | step: 1.83
 99%|█████████▉| 3472/3507 [1:26:23<01:00,  1.74s/it]                                                     {'loss': 0.8123, 'learning_rate': 5.225823591903378e-09, 'epoch': 0.99}
 99%|█████████▉| 3472/3507 [1:26:23<01:00,  1.74s/it]tensor([[-2.9531, -3.2812, -0.8867,  3.1406, -0.5117]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2812, -3.6406,  0.5625,  3.8438, -1.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.3438, -5.4375, -0.3574,  2.5938, -3.4844]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:09,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.83 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-4.0000,  0.1816,  3.2812, -1.5859, -4.3750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.1875, -1.6094,  2.2188, -0.9570, -4.7500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.1875, -4.7500, -0.6445,  2.5156, -2.5469]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -4.2500, -0.1396,  2.7188, -2.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2812,  1.2812,  2.8438, -1.1328, -2.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:11:11,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.23 | optimizer_gradients: 0.19 | optimizer_step: 0.56
[2025-11-06 19:11:11,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 139.01 | bwd_microstep: 1780.62 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1779.42 | step_microstep: 2.97
[2025-11-06 19:11:11,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 354.85 | bwd: 1781.52 | bwd_inner: 1.90 | bwd_allreduce: 1779.47 | step: 3.06
 99%|█████████▉| 3473/3507 [1:26:25<01:03,  1.87s/it]                                                     {'loss': 0.0727, 'learning_rate': 4.93149527513781e-09, 'epoch': 0.99}
 99%|█████████▉| 3473/3507 [1:26:25<01:03,  1.87s/it]tensor([[-2.2344,  1.4766,  2.2031, -2.1094, -2.9219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:11,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 70.39 | bwd_microstep: 1.04 | bwd_inner_microstep: 0.94 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.0938, -4.3125, -0.1777,  2.5000, -2.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.8125, -3.4062,  1.9062,  1.6797, -4.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2031, -1.0312,  1.0156,  2.0000, -1.0391]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3750,  2.0625,  4.4688, -1.7344, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.8750, -4.4062, -0.1484,  3.4375, -2.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.1719, -3.9375, -2.3906,  1.9922, -0.4785]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-7.3438, -4.3125,  2.0156,  1.0469, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:11:12,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.25 | optimizer_step: 0.20
[2025-11-06 19:11:12,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.70 | bwd_microstep: 227.72 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 226.51 | step_microstep: 2.12
[2025-11-06 19:11:12,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 296.11 | bwd: 228.76 | bwd_inner: 2.06 | bwd_allreduce: 226.56 | step: 2.19
 99%|█████████▉| 3474/3507 [1:26:26<00:48,  1.47s/it]                                                     {'loss': 0.2807, 'learning_rate': 4.645695449965182e-09, 'epoch': 0.99}
 99%|█████████▉| 3474/3507 [1:26:26<00:48,  1.47s/it]tensor([[-5.2812, -5.4375, -2.4844,  1.5156, -2.2969]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9219, -0.6680,  1.7734,  0.6406, -2.2656]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2500, -3.5312,  1.2344,  2.5469, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.7188, -2.3750,  2.7812,  0.8008, -4.6562]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:12,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.96 | bwd_microstep: 0.96 | bwd_inner_microstep: 0.85 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -4.2500,  0.4531,  3.2031, -2.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -2.9375,  0.5703,  1.8828, -2.5156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8750, -5.7500, -1.7500,  2.2969, -2.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2344,  0.5273,  3.1406, -1.3828, -3.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:11:13,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.47 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:11:13,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.33 | bwd_microstep: 482.70 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 481.86 | step_microstep: 2.04
[2025-11-06 19:11:13,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 391.32 | bwd: 483.66 | bwd_inner: 1.61 | bwd_allreduce: 481.91 | step: 2.12
 99%|█████████▉| 3475/3507 [1:26:27<00:41,  1.31s/it]                                                     {'loss': 0.4565, 'learning_rate': 4.368424360251533e-09, 'epoch': 0.99}
 99%|█████████▉| 3475/3507 [1:26:27<00:41,  1.31s/it]tensor([[-4.2188, -2.3906,  1.6094,  2.0469, -2.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:13,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 133.24 | bwd_microstep: 2.82 | bwd_inner_microstep: 2.67 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.2500, -2.3594,  1.7734, -1.9062, -5.7188]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1250, -2.3594,  1.5781,  2.4375, -2.4844]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -4.0625, -0.0129,  3.1250, -2.0781]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.9688, -3.0000,  1.4453,  1.9062, -3.2812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5938, -3.6250,  0.2432,  2.4844, -2.3906]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.6562, -4.7812, -1.4062,  2.6094, -1.8047]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.6562, -0.2734,  3.5156, -1.6875, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:11:13,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.17 | optimizer_step: 0.17
[2025-11-06 19:11:13,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.92 | bwd_microstep: 34.04 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 33.17 | step_microstep: 1.97
[2025-11-06 19:11:13,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 324.19 | bwd: 36.86 | bwd_inner: 3.48 | bwd_allreduce: 33.22 | step: 2.07
 99%|█████████▉| 3476/3507 [1:26:27<00:32,  1.03s/it]                                                     {'loss': 0.2044, 'learning_rate': 4.099682242580949e-09, 'epoch': 0.99}
 99%|█████████▉| 3476/3507 [1:26:27<00:32,  1.03s/it]tensor([[-4.4062, -1.4141,  2.8438,  1.0391, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.6875, -5.9688, -1.8672, -0.9219, -5.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.2812, -4.1250,  2.1562,  0.8320, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:13,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.31 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.66 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-3.4062, -4.1250, -1.8672,  2.7031, -0.6172]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.2500, -2.4062,  2.8750, -0.5117, -5.5312]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8438, -3.4844, -1.8359,  2.3125, -0.4238]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:0')
tensor([[-5.1250, -5.0000, -0.5117,  3.6250, -2.1250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.0938, -4.7188, -2.2500,  2.2656, -1.2266]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:11:16,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:11:16,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.31 | bwd_microstep: 1447.64 | bwd_inner_microstep: 0.87 | bwd_allreduce_microstep: 1446.67 | step_microstep: 2.11
[2025-11-06 19:11:16,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 405.63 | bwd: 1448.38 | bwd_inner: 1.54 | bwd_allreduce: 1446.71 | step: 2.19
 99%|█████████▉| 3477/3507 [1:26:29<00:44,  1.47s/it]                                                     {'loss': 0.5837, 'learning_rate': 3.839469326265555e-09, 'epoch': 0.99}
 99%|█████████▉| 3477/3507 [1:26:29<00:44,  1.47s/it]tensor([[-1.1094, -2.0938, -2.3438,  1.0312,  0.7930]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-3.1094, -3.4219, -1.0859,  2.5781, -0.7539]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.8438, -4.2812,  0.1006,  1.4609, -3.6719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.9531,  1.0938,  3.7500, -1.1172, -3.5781]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.4375, -3.0938,  2.5000,  0.3125, -5.2812]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.7812, -5.1250, -0.3809,  3.1094, -2.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4062, -3.9688,  0.6172,  0.0952, -4.7812]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:17,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.80 | bwd_microstep: 0.88 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0000, -3.5000,  1.0625,  2.6719, -2.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:17,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.50 | optimizer_gradients: 0.18 | optimizer_step: 0.16
[2025-11-06 19:11:17,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.45 | bwd_microstep: 1.79 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.73 | step_microstep: 2.16
[2025-11-06 19:11:17,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 427.28 | bwd: 2.68 | bwd_inner: 1.79 | bwd_allreduce: 0.77 | step: 2.25
 99%|█████████▉| 3478/3507 [1:26:31<00:40,  1.38s/it]                                                     {'loss': 0.3283, 'learning_rate': 3.5877858333366323e-09, 'epoch': 0.99}
 99%|█████████▉| 3478/3507 [1:26:31<00:40,  1.38s/it]tensor([[-3.5938, -1.3281,  2.6406,  1.9453, -2.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.0625, -7.1250, -1.8594,  1.2969, -4.8125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0312, -5.8125, -2.3438,  3.1562, -1.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:17,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.76 | bwd_microstep: 1.02 | bwd_inner_microstep: 0.91 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.09
tensor([[-3.4844, -1.2891,  1.2734, -0.0767, -2.7344]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.5156, -4.1562, -2.3594,  1.6406, -0.9023]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9062, -2.2969,  1.4766, -0.0240, -3.9219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.8125, -4.2812, -0.0713,  3.2656, -2.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-2.3438, -3.0469, -1.8750,  1.7969, -0.1289]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
[2025-11-06 19:11:19,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.32 | optimizer_gradients: 0.19 | optimizer_step: 0.20
[2025-11-06 19:11:19,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.20 | bwd_microstep: 1387.55 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1386.45 | step_microstep: 2.11
[2025-11-06 19:11:19,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 406.97 | bwd: 1388.57 | bwd_inner: 1.92 | bwd_allreduce: 1386.50 | step: 2.20
 99%|█████████▉| 3479/3507 [1:26:32<00:42,  1.54s/it]                                                     {'loss': 0.408, 'learning_rate': 3.3446319785468418e-09, 'epoch': 0.99}
 99%|█████████▉| 3479/3507 [1:26:32<00:42,  1.54s/it]tensor([[-2.1875,  1.9766,  3.5469, -2.0000, -3.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.6719, -4.0312, -0.8008,  3.8906, -0.8672]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5000, -3.4844, -0.0444,  2.0000, -2.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:19,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.52 | bwd_microstep: 0.78 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.5938, -3.7031,  1.0234,  1.6875, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -0.6094,  3.8906, -1.6562, -5.5312]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-2.5781,  1.5547,  3.4531, -1.5234, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-3.1094, -1.1016,  0.5703, -0.4707, -2.4844]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.2656, -4.0938, -3.1094,  0.7422, -0.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:19,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.16 | optimizer_step: 0.19
[2025-11-06 19:11:19,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.55 | bwd_microstep: 1.98 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 0.86 | step_microstep: 2.19
[2025-11-06 19:11:19,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 358.10 | bwd: 2.75 | bwd_inner: 1.71 | bwd_allreduce: 0.89 | step: 2.28
 99%|█████████▉| 3480/3507 [1:26:33<00:35,  1.32s/it]                                                     {'loss': 0.6993, 'learning_rate': 3.1100079693735517e-09, 'epoch': 0.99}
 99%|█████████▉| 3480/3507 [1:26:33<00:35,  1.32s/it]tensor([[-4.3750, -1.8047,  2.4844,  1.6797, -3.2344]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0938, -4.4062,  0.7109,  2.3125, -3.7188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:20,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.76 | bwd_microstep: 0.82 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-7.5938, -6.1250, -0.2812,  1.9453, -4.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -2.3750,  1.9453,  0.7578, -4.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-1.2031,  1.4453,  1.7109, -1.3359, -1.7422]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:3')
tensor([[-4.3750, -3.4219,  0.3320,  2.4844, -2.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.4375, -5.9375, -1.6797,  1.4922, -3.5625]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5312, -3.7188,  0.5039,  0.9336, -3.7656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:22,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.49 | optimizer_gradients: 0.15 | optimizer_step: 0.21
[2025-11-06 19:11:22,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 136.85 | bwd_microstep: 2.10 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.37
[2025-11-06 19:11:22,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 338.63 | bwd: 2.92 | bwd_inner: 1.82 | bwd_allreduce: 0.97 | step: 2.44
 99%|█████████▉| 3481/3507 [1:26:36<00:42,  1.65s/it]                                                     {'loss': 0.5955, 'learning_rate': 2.883914006014399e-09, 'epoch': 0.99}
 99%|█████████▉| 3481/3507 [1:26:36<00:42,  1.65s/it]tensor([[-4.3125, -0.4395,  3.3125, -0.8086, -4.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -5.3438, -0.4766,  2.0781, -3.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.7656,  0.0275,  1.5859, -2.5938, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.7812,  0.0894,  2.7656, -1.2500, -3.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.9062, -1.7578,  3.7344, -0.2119, -5.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.9844, -1.9531,  1.0625,  0.5469, -2.8906]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[1.5859, 4.3438, 4.4688, 0.9688, 0.4062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:0')
[2025-11-06 19:11:23,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 243.86 | bwd_microstep: 0.67 | bwd_inner_microstep: 0.57 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.3750, -2.6719,  0.5703,  0.6289, -3.0469]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:23,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.75 | optimizer_gradients: 0.24 | optimizer_step: 0.26
[2025-11-06 19:11:23,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.27 | bwd_microstep: 2.88 | bwd_inner_microstep: 1.49 | bwd_allreduce_microstep: 1.28 | step_microstep: 7.78
[2025-11-06 19:11:23,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 462.17 | bwd: 3.56 | bwd_inner: 2.08 | bwd_allreduce: 1.32 | step: 7.87
 99%|█████████▉| 3482/3507 [1:26:37<00:38,  1.54s/it]                                                     {'loss': 0.3154, 'learning_rate': 2.6663502813872865e-09, 'epoch': 0.99}
 99%|█████████▉| 3482/3507 [1:26:37<00:38,  1.54s/it]tensor([[-6.0312, -4.2500,  0.3242,  0.9102, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:23,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 125.86 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.9844, -1.1719,  1.1094,  1.3203, -1.8828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.5312, -0.4590,  2.9688, -1.7891, -4.7188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9375, -0.9688,  1.6406, -2.8438, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:1')
tensor([[-6.3750, -2.0156,  2.5312, -2.2656, -6.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.9062, -4.5000,  1.0391,  1.3359, -4.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2812, -2.9219,  2.5312,  0.5039, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.9375, -5.5000, -0.3594,  1.5469, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:24,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.51 | optimizer_gradients: 0.17 | optimizer_step: 0.16
[2025-11-06 19:11:24,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.42 | bwd_microstep: 1.81 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.39
[2025-11-06 19:11:24,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 279.29 | bwd: 2.67 | bwd_inner: 1.72 | bwd_allreduce: 0.82 | step: 2.47
 99%|█████████▉| 3483/3507 [1:26:38<00:35,  1.47s/it]                                                     {'loss': 0.624, 'learning_rate': 2.4573169811337173e-09, 'epoch': 0.99}
 99%|█████████▉| 3483/3507 [1:26:38<00:35,  1.47s/it]tensor([[-4.2500, -4.4688, -1.3984,  2.8125, -1.4609]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.4375, -4.2188, -0.9375,  2.3125, -1.9609]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:25,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 153.03 | bwd_microstep: 0.92 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-6.3438, -5.3125, -0.4980,  2.2500, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.5312, -1.7891,  3.1406, -0.1611, -4.9375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.0000, -3.7656,  0.9805,  1.0547, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4062, -3.9531, -0.1807,  1.1172, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -4.0000,  1.1016,  1.4922, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-3.8281, -4.0312, -0.9141,  2.9375, -1.2891]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
[2025-11-06 19:11:25,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:11:25,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 156.00 | bwd_microstep: 206.01 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 205.20 | step_microstep: 1.51
[2025-11-06 19:11:25,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.04 | bwd: 206.93 | bwd_inner: 1.53 | bwd_allreduce: 205.25 | step: 1.61
 99%|█████████▉| 3484/3507 [1:26:39<00:27,  1.21s/it]                                                     {'loss': 1.3292, 'learning_rate': 2.2568142836154607e-09, 'epoch': 0.99}
 99%|█████████▉| 3484/3507 [1:26:39<00:27,  1.21s/it]tensor([[-5.8125e+00, -3.9375e+00,  5.1575e-03,  4.5898e-01, -3.9531e+00]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:25,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.17 | bwd_microstep: 0.65 | bwd_inner_microstep: 0.55 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-2.7656, -3.5938, -1.6016,  3.0625, -0.1089]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -3.9375,  0.3789,  1.2891, -3.7344]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.0312, -3.0625,  0.1016,  4.0625, -0.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.1875, -3.6562,  0.0215,  2.9375, -1.8750]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5312, -1.3359,  1.9609, -0.8828, -4.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.2188, -3.4062, -0.0103,  4.4062, -0.5859]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.2812, -4.5938,  0.3086,  2.0000, -3.8438]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:28,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.53 | optimizer_gradients: 0.13 | optimizer_step: 0.15
[2025-11-06 19:11:28,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 118.59 | bwd_microstep: 2.00 | bwd_inner_microstep: 1.14 | bwd_allreduce_microstep: 0.79 | step_microstep: 2.11
[2025-11-06 19:11:28,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 273.77 | bwd: 2.65 | bwd_inner: 1.71 | bwd_allreduce: 0.82 | step: 2.19
 99%|█████████▉| 3485/3507 [1:26:41<00:35,  1.62s/it]                                                     {'loss': 0.1611, 'learning_rate': 2.0648423599156642e-09, 'epoch': 0.99}
 99%|█████████▉| 3485/3507 [1:26:41<00:35,  1.62s/it]tensor([[-7.2812, -5.5312, -1.2344, -0.3887, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.9688, -6.0000, -1.7891,  0.6992, -4.2500]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.6875, -3.3125,  1.9531,  2.1250, -3.8906]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.9062, -3.3750,  1.5859,  0.7461, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:28,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.83 | bwd_microstep: 0.75 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-3.1406,  1.6484,  4.0000, -2.6250, -4.3438]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.9375, -6.1875, -1.3281, -0.3730, -5.4688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5000,  1.3438,  2.0469, -2.9219, -3.3750]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.5938, -1.0391,  4.0000, -1.0781, -5.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:28,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.27 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:11:28,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.31 | bwd_microstep: 1.95 | bwd_inner_microstep: 1.07 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.40
[2025-11-06 19:11:28,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 461.17 | bwd: 2.70 | bwd_inner: 1.75 | bwd_allreduce: 0.84 | step: 1.47
 99%|█████████▉| 3486/3507 [1:26:42<00:26,  1.28s/it]                                                     {'loss': 0.4124, 'learning_rate': 1.8814013738377436e-09, 'epoch': 0.99}
 99%|█████████▉| 3486/3507 [1:26:42<00:26,  1.28s/it]tensor([[-3.7344,  0.5781,  3.3438, -1.8125, -4.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.2500, -4.1562,  0.2002,  0.4297, -4.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.9062, -1.4609,  2.2188, -0.7617, -4.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:28,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 149.35 | bwd_microstep: 0.94 | bwd_inner_microstep: 0.82 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.08
tensor([[-3.4375,  0.0299,  2.5781, -1.3984, -3.6406]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-8.1250, -7.6562, -2.5781,  1.5234, -4.5938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2812, -4.8750, -0.5430,  3.0000, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8125, -1.0391,  1.8281, -1.7656, -4.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -3.7031,  0.9141,  2.5156, -3.0312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:30,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.44 | optimizer_gradients: 0.14 | optimizer_step: 0.16
[2025-11-06 19:11:30,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.42 | bwd_microstep: 1.91 | bwd_inner_microstep: 0.97 | bwd_allreduce_microstep: 0.87 | step_microstep: 2.20
[2025-11-06 19:11:30,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 334.78 | bwd: 2.85 | bwd_inner: 1.82 | bwd_allreduce: 0.90 | step: 2.28
 99%|█████████▉| 3487/3507 [1:26:44<00:30,  1.50s/it]                                                     {'loss': 0.1735, 'learning_rate': 1.7064914819064914e-09, 'epoch': 0.99}
 99%|█████████▉| 3487/3507 [1:26:44<00:30,  1.50s/it]tensor([[-0.7227,  2.9219,  4.0938, -0.1064, -1.6406]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.4688, -2.3125,  1.7969,  3.7812, -1.5156]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.5312, -0.8984,  2.7500,  1.1094, -2.8906]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:30,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.57 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.75 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5312, -4.9375,  0.1245,  1.4922, -4.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -3.8125,  0.0742,  2.3125, -2.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-4.5312, -0.1914,  3.6875, -1.3516, -4.8125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.1719,  0.8945,  3.3438, -1.3828, -3.7031]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.7188, -6.1562, -1.8203,  1.3438, -3.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:11:31,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.12 | optimizer_gradients: 0.14 | optimizer_step: 0.15
[2025-11-06 19:11:31,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 113.05 | bwd_microstep: 89.07 | bwd_inner_microstep: 1.13 | bwd_allreduce_microstep: 87.85 | step_microstep: 1.40
[2025-11-06 19:11:31,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 265.65 | bwd: 89.92 | bwd_inner: 1.90 | bwd_allreduce: 87.88 | step: 1.48
 99%|█████████▉| 3488/3507 [1:26:44<00:22,  1.17s/it]                                                     {'loss': 0.5188, 'learning_rate': 1.540112833366969e-09, 'epoch': 0.99}
 99%|█████████▉| 3488/3507 [1:26:44<00:22,  1.17s/it]tensor([[-4.8438, -4.0000, -0.5273,  1.3594, -2.8281]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-7.3750, -4.8750,  1.0547,  0.9453, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2500, -4.1250, -0.5078,  3.1250, -1.6562]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.1562, -3.6562,  0.4609,  3.7812, -1.6016]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:31,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.44 | bwd_microstep: 0.95 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-3.3906,  0.9375,  3.3906, -2.1719, -4.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-8.7500, -7.6562, -1.8203,  1.4688, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.8438, -2.0781,  2.0156,  0.5078, -3.8594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.6250, -5.5312,  0.6836,  1.8047, -5.0625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:33,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.53 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:11:33,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.01 | bwd_microstep: 2.02 | bwd_inner_microstep: 1.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 4.00
[2025-11-06 19:11:33,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 493.47 | bwd: 2.96 | bwd_inner: 1.93 | bwd_allreduce: 0.90 | step: 4.08
 99%|█████████▉| 3489/3507 [1:26:47<00:29,  1.66s/it]                                                     {'loss': 0.4373, 'learning_rate': 1.3822655701856147e-09, 'epoch': 0.99}
 99%|█████████▉| 3489/3507 [1:26:47<00:29,  1.66s/it]tensor([[-4.3438, -1.7031,  1.9766,  0.2734, -3.5781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5938, -2.8281,  0.5977,  0.9023, -3.1250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.7188, -3.6562,  0.9766,  1.4766, -3.8281]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-3.7656, -3.0156,  0.6758,  3.1719, -1.6797]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:34,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.27 | bwd_microstep: 0.76 | bwd_inner_microstep: 0.67 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-6.5938, -3.1875,  2.2344,  0.3301, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.8750, -2.3438,  2.5625, -0.5078, -5.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.5312, -1.9922,  2.1562,  1.3984, -3.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -3.0938,  1.4453,  1.9609, -3.4219]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:34,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.33 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:11:34,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.14 | bwd_microstep: 1.81 | bwd_inner_microstep: 1.02 | bwd_allreduce_microstep: 0.72 | step_microstep: 1.69
[2025-11-06 19:11:34,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 398.43 | bwd: 2.57 | bwd_inner: 1.71 | bwd_allreduce: 0.75 | step: 1.77
100%|█████████▉| 3490/3507 [1:26:48<00:21,  1.29s/it]                                                     {'loss': 1.0257, 'learning_rate': 1.2329498270480245e-09, 'epoch': 1.0}
100%|█████████▉| 3490/3507 [1:26:48<00:21,  1.29s/it]tensor([[-3.9844, -1.2656,  1.9766,  0.5742, -3.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.3438, -3.3281,  0.9648,  1.4219, -3.6094]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.9688, -2.0469,  1.2109, -0.9258, -4.2500]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:34,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.99 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.62 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-8.4375, -5.5312,  0.8164,  0.1157, -6.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7500, -4.4062,  0.1328,  2.2656, -3.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.3750, -1.3906,  2.8281, -1.1406, -5.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-3.7656, -0.5273,  2.5938,  0.1104, -3.3594]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.0625, -3.2188,  2.1719,  1.4141, -4.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:35,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.43 | optimizer_gradients: 0.16 | optimizer_step: 0.20
[2025-11-06 19:11:35,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 142.97 | bwd_microstep: 1.90 | bwd_inner_microstep: 0.88 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.16
[2025-11-06 19:11:35,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 344.98 | bwd: 2.63 | bwd_inner: 1.51 | bwd_allreduce: 0.97 | step: 2.26
100%|█████████▉| 3491/3507 [1:26:49<00:21,  1.35s/it]                                                     {'loss': 0.3341, 'learning_rate': 1.0921657313622825e-09, 'epoch': 1.0}
100%|█████████▉| 3491/3507 [1:26:49<00:21,  1.35s/it]tensor([[-2.7031, -2.9688, -0.1436,  3.9219, -0.2773]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2188e+00, -4.3125e+00,  1.9836e-04,  2.3906e+00, -2.8281e+00]],
       device='cuda:2', dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.2812, -4.2188, -1.1484,  2.3125, -1.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-8.6875, -6.5625, -0.3086,  0.7148, -5.9375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:36,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.87 | bwd_microstep: 0.86 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-8.3750, -4.6875,  1.4062, -1.1875, -6.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.4375, -3.1562,  0.5586, -1.8438, -5.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2812, -4.6875, -0.7578,  2.1719, -2.6875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.3438, -4.9062,  0.6445,  2.6562, -3.7812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:11:36,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.17 | optimizer_gradients: 0.23 | optimizer_step: 0.21
[2025-11-06 19:11:36,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.89 | bwd_microstep: 25.22 | bwd_inner_microstep: 1.10 | bwd_allreduce_microstep: 24.01 | step_microstep: 2.07
[2025-11-06 19:11:36,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.79 | bwd: 26.08 | bwd_inner: 1.86 | bwd_allreduce: 24.06 | step: 2.16
100%|█████████▉| 3492/3507 [1:26:50<00:16,  1.08s/it]                                                     {'loss': 0.1077, 'learning_rate': 9.599134032534096e-10, 'epoch': 1.0}
100%|█████████▉| 3492/3507 [1:26:50<00:16,  1.08s/it]tensor([[-5.7188, -1.8984,  2.2500, -1.3047, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:36,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 135.68 | bwd_microstep: 5.26 | bwd_inner_microstep: 5.14 | bwd_allreduce_microstep: 0.04 | step_microstep: 0.09
tensor([[-5.2812, -4.0000,  0.6875,  2.7969, -2.9531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-7.3438, -5.3125,  0.4336,  1.5938, -4.8750]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')tensor([2], device='cuda:3')

tensor([[-6.5312, -2.6719,  1.8281, -1.8125, -5.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.5312, -2.4688,  1.4531,  1.5000, -3.0938]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0938, -1.4844,  2.9531, -0.2656, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.9688, -1.1094,  3.2969, -0.4512, -4.6875]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.7500, -1.7812,  2.4688,  0.3906, -3.9688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:39,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.17 | optimizer_step: 0.19
[2025-11-06 19:11:39,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.79 | bwd_microstep: 1.77 | bwd_inner_microstep: 0.84 | bwd_allreduce_microstep: 0.83 | step_microstep: 2.45
[2025-11-06 19:11:39,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 349.48 | bwd: 7.02 | bwd_inner: 5.99 | bwd_allreduce: 0.88 | step: 2.54
100%|█████████▉| 3493/3507 [1:26:53<00:23,  1.70s/it]                                                     {'loss': 0.324, 'learning_rate': 8.361929555700255e-10, 'epoch': 1.0}
100%|█████████▉| 3493/3507 [1:26:53<00:23,  1.70s/it]tensor([[-7.0625, -4.0625,  1.8281,  0.5234, -5.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-1.7422,  2.6250,  5.2500, -0.3125, -2.7969]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -5.3750, -3.6719,  1.5859, -1.0000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:2')
tensor([[-4.9062, -4.5938, -0.5898,  2.6562, -2.2812]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-3.2656, -4.0000, -1.8281,  2.6875, -0.6055]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.7812, -4.1250, -0.4004,  0.4316, -3.8438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.7812, -1.8672,  3.8125,  0.3750, -5.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:39,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.44 | bwd_microstep: 0.70 | bwd_inner_microstep: 0.59 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-4.2812, -3.9844,  0.0205,  3.7344, -1.6406]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:40,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.38 | optimizer_gradients: 0.15 | optimizer_step: 0.16
[2025-11-06 19:11:40,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.49 | bwd_microstep: 1.66 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 0.78 | step_microstep: 1.90
[2025-11-06 19:11:40,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 462.97 | bwd: 2.35 | bwd_inner: 1.39 | bwd_allreduce: 0.82 | step: 1.98
100%|█████████▉| 3494/3507 [1:26:53<00:18,  1.42s/it]                                                     {'loss': 0.4412, 'learning_rate': 7.210044938776862e-10, 'epoch': 1.0}
100%|█████████▉| 3494/3507 [1:26:53<00:18,  1.42s/it]tensor([[-2.0938,  1.4453,  2.6875, -1.5312, -2.7500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([1], device='cuda:2')
tensor([[-4.7500, -5.2500, -2.6562,  1.7656, -1.7031]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([[-2.9531,  1.5391,  3.8281, -2.1094, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([2], device='cuda:1')
[2025-11-06 19:11:40,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.13 | bwd_microstep: 0.77 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.2188, -3.3125,  0.8750,  1.5078, -3.4375]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.4062, -3.9531, -0.1572,  3.1406, -1.9062]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9375, -4.7812, -0.6953,  3.1719, -2.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.3125, -4.6562, -2.7656,  0.8594, -1.6719]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6875, -2.2969,  3.0156, -1.4844, -6.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:11:42,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.48 | optimizer_gradients: 0.25 | optimizer_step: 0.23
[2025-11-06 19:11:42,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 107.81 | bwd_microstep: 1214.96 | bwd_inner_microstep: 0.99 | bwd_allreduce_microstep: 1213.87 | step_microstep: 2.62
[2025-11-06 19:11:42,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 271.96 | bwd: 1215.74 | bwd_inner: 1.66 | bwd_allreduce: 1213.92 | step: 2.70
100%|█████████▉| 3495/3507 [1:26:56<00:20,  1.73s/it]                                                     {'loss': 0.3659, 'learning_rate': 6.14348116464436e-10, 'epoch': 1.0}
100%|█████████▉| 3495/3507 [1:26:56<00:20,  1.73s/it]tensor([[-5.3438, -4.1250,  0.2637,  2.1719, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.2188, -3.4844,  1.4609,  0.7734, -4.6250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -5.0312, -1.2578,  2.7344, -2.1094]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0938, -5.6875, -1.3281,  2.2188, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-3.1719,  0.1406,  2.5625, -0.6367, -3.1875]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.0000, -3.7344,  0.8984,  0.7734, -4.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.3125, -4.5938,  0.1514,  1.3125, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:43,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.78 | bwd_microstep: 0.99 | bwd_inner_microstep: 0.89 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.1875, -3.7656,  0.3711,  1.7344, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:43,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.37 | optimizer_gradients: 0.13 | optimizer_step: 0.16
[2025-11-06 19:11:43,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.44 | bwd_microstep: 2.13 | bwd_inner_microstep: 1.25 | bwd_allreduce_microstep: 0.81 | step_microstep: 1.85
[2025-11-06 19:11:43,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 376.24 | bwd: 3.11 | bwd_inner: 2.15 | bwd_allreduce: 0.84 | step: 1.93
100%|█████████▉| 3496/3507 [1:26:57<00:16,  1.52s/it]                                                     {'loss': 0.6461, 'learning_rate': 5.162239143352565e-10, 'epoch': 1.0}
100%|█████████▉| 3496/3507 [1:26:57<00:16,  1.52s/it]tensor([[-1.6016, -1.3359,  1.6172,  4.7812,  0.2930]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:43,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.84 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.02 | step_microstep: 0.07
tensor([[-5.9688, -2.1250,  3.3281, -0.0262, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.2812, -5.3750, -0.3105,  0.4746, -4.9688]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.6875, -1.9844,  1.9922,  0.6797, -3.6875]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.5312, -6.0000, -1.0703,  2.5938, -3.3594]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-5.0938, -1.6250,  1.8281, -1.2969, -4.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4688, -4.1250,  1.7578,  2.1719, -4.3750]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-6.6562, -2.9219,  2.2969, -0.6406, -5.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:11:45,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.28 | optimizer_gradients: 0.18 | optimizer_step: 0.17
[2025-11-06 19:11:45,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 131.17 | bwd_microstep: 741.79 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 740.75 | step_microstep: 1.83
[2025-11-06 19:11:45,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 309.03 | bwd: 742.64 | bwd_inner: 1.72 | bwd_allreduce: 740.79 | step: 1.90
100%|█████████▉| 3497/3507 [1:26:59<00:17,  1.76s/it]                                                     {'loss': 0.2729, 'learning_rate': 4.266319712187272e-10, 'epoch': 1.0}
100%|█████████▉| 3497/3507 [1:26:59<00:17,  1.76s/it]tensor([[-5.1875, -5.6875, -2.8594,  1.4375, -2.1562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([4], device='cuda:3')
tensor([[-4.9375, -1.5234,  2.0938, -0.7852, -4.4375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-2.8281,  0.8242,  2.0938, -2.4219, -3.4375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-4.5625, -3.5938,  0.2344,  2.3594, -2.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-6.5625, -6.5625, -2.1250,  2.5469, -3.0938]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -1.7422,  2.7344, -0.8906, -5.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-0.8047,  1.5312,  4.0000,  2.8281, -0.5859]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:47,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.66 | bwd_microstep: 0.85 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-9.7500, -7.0000, -0.8281, -1.0625, -7.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:47,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.29 | optimizer_gradients: 0.25 | optimizer_step: 0.26
[2025-11-06 19:11:47,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.56 | bwd_microstep: 2.40 | bwd_inner_microstep: 1.11 | bwd_allreduce_microstep: 1.11 | step_microstep: 2.62
[2025-11-06 19:11:47,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 517.22 | bwd: 3.27 | bwd_inner: 1.87 | bwd_allreduce: 1.17 | step: 2.69
100%|█████████▉| 3498/3507 [1:27:01<00:15,  1.75s/it]                                                     {'loss': 0.6661, 'learning_rate': 3.455723635592545e-10, 'epoch': 1.0}
100%|█████████▉| 3498/3507 [1:27:01<00:15,  1.75s/it]tensor([[-0.0603,  2.1094,  4.0938,  3.0781,  0.0249]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.0938, -3.2656,  1.0156,  1.3750, -3.4531]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-2.2812,  1.8281,  3.0625, -2.6250, -3.4062]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.5938, -2.3750,  2.8750,  0.9688, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:47,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.71 | bwd_microstep: 0.89 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.7188,  0.8008,  3.9375,  0.7734, -2.7656]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.5625, -1.8984,  2.7656, -0.4609, -4.9375]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.2188, -3.2656,  0.6211,  0.7070, -3.6562]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-4.2188, -2.6094,  1.0547,  2.0938, -2.5156]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:48,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.41 | optimizer_gradients: 0.20 | optimizer_step: 0.21
[2025-11-06 19:11:48,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.11 | bwd_microstep: 2.05 | bwd_inner_microstep: 1.00 | bwd_allreduce_microstep: 0.93 | step_microstep: 2.51
[2025-11-06 19:11:48,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 442.86 | bwd: 2.93 | bwd_inner: 1.76 | bwd_allreduce: 0.98 | step: 2.61
100%|█████████▉| 3499/3507 [1:27:01<00:10,  1.37s/it]                                                     {'loss': 0.3117, 'learning_rate': 2.7304516052373277e-10, 'epoch': 1.0}
100%|█████████▉| 3499/3507 [1:27:01<00:10,  1.37s/it]tensor([[-7.5000, -6.8125, -2.8125,  0.1699, -4.5000]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.5000, -3.7969,  0.2988,  1.1328, -3.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:48,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 143.27 | bwd_microstep: 0.90 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-5.7812, -2.1094,  2.9688, -0.1602, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-4.0000,  0.4941,  3.5156, -2.1250, -4.5938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.7500, -3.7656,  2.0000,  0.6680, -5.2188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-7.1250, -5.4062,  0.6367,  2.7812, -4.3125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-5.2812, -5.7188, -2.6719,  1.8281, -2.1719]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.5625,  0.1318,  4.3438, -1.5078, -5.0625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
[2025-11-06 19:11:48,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.21 | optimizer_gradients: 0.17 | optimizer_step: 0.18
[2025-11-06 19:11:48,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 141.68 | bwd_microstep: 128.49 | bwd_inner_microstep: 0.78 | bwd_allreduce_microstep: 127.62 | step_microstep: 1.78
[2025-11-06 19:11:48,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 284.96 | bwd: 129.38 | bwd_inner: 1.55 | bwd_allreduce: 127.67 | step: 1.87
100%|█████████▉| 3500/3507 [1:27:02<00:07,  1.10s/it]                                                     {'loss': 0.1163, 'learning_rate': 2.090504239959934e-10, 'epoch': 1.0}
100%|█████████▉| 3500/3507 [1:27:02<00:07,  1.10s/it]tensor([[-5.8125, -5.2812, -1.4766,  1.3438, -3.1875]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.5938, -2.9219,  1.2578,  2.0469, -2.9062]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:11:48,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.70 | bwd_microstep: 0.66 | bwd_inner_microstep: 0.56 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.7188, -4.7500, -0.3926,  2.1406, -3.1719]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.6875, -2.6875,  2.4531,  0.9375, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.6250, -1.1172,  4.0938, -0.8125, -5.5312]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.2188, -1.1875,  3.2188, -0.8594, -5.0000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.0625, -0.5469,  1.2266, -2.4219, -4.1250]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-4.8438, -0.8086,  3.0469, -1.0547, -4.7188]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
[2025-11-06 19:11:51,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.42 | optimizer_gradients: 0.24 | optimizer_step: 0.32
[2025-11-06 19:11:51,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.64 | bwd_microstep: 1666.82 | bwd_inner_microstep: 4.84 | bwd_allreduce_microstep: 1661.84 | step_microstep: 4.53
[2025-11-06 19:11:51,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 359.36 | bwd: 1667.49 | bwd_inner: 5.42 | bwd_allreduce: 1661.90 | step: 4.61
100%|█████████▉| 3501/3507 [1:27:05<00:10,  1.74s/it]                                                     {'loss': 0.7944, 'learning_rate': 1.535882085823559e-10, 'epoch': 1.0}
100%|█████████▉| 3501/3507 [1:27:05<00:10,  1.74s/it]tensor([[-1.8906, -0.9453,  1.8359,  3.5312, -0.4902]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.4375, -4.6562, -0.0967,  1.1641, -4.1562]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-6.5000, -4.3438,  0.6211,  0.8398, -4.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:52,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.08 | bwd_microstep: 5.37 | bwd_inner_microstep: 5.20 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
tensor([[-5.6250, -2.5469,  2.5938,  0.8242, -4.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-7.8438, -6.4375, -1.3906,  0.3125, -5.1250]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-2.6406,  0.5391,  2.5469, -0.1689, -2.6250]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.1562, -3.4062,  1.4766,  2.3438, -3.2656]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-5.4375, -3.0938,  1.1172,  0.5586, -4.0625]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:11:52,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.18 | optimizer_gradients: 0.23 | optimizer_step: 0.20
[2025-11-06 19:11:52,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.87 | bwd_microstep: 94.09 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 92.93 | step_microstep: 2.12
[2025-11-06 19:11:52,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 396.96 | bwd: 99.45 | bwd_inner: 6.26 | bwd_allreduce: 92.99 | step: 2.23
100%|█████████▉| 3502/3507 [1:27:06<00:06,  1.39s/it]                                                     {'loss': 0.7776, 'learning_rate': 1.066585616071869e-10, 'epoch': 1.0}
100%|█████████▉| 3502/3507 [1:27:06<00:06,  1.39s/it]tensor([[-5.1562, -4.7188,  0.0079,  3.6875, -2.2656]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.2500, -3.9375,  0.3730,  2.0781, -3.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
[2025-11-06 19:11:52,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.12 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
tensor([[-2.9688,  1.5938,  4.5312, -1.4531, -3.9375]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-5.1250, -4.4688, -0.0928,  2.7500, -2.5938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.6562, -0.2158,  2.5156, -0.8906, -3.6250]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-6.9688, -6.8750, -2.2969,  2.0312, -3.5625]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6875, -5.2188,  1.3438,  1.8750, -5.2812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.4688, -4.6250, -1.8438,  2.0156, -1.7891]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:11:56,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.15 | optimizer_gradients: 0.22 | optimizer_step: 0.21
[2025-11-06 19:11:56,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.32 | bwd_microstep: 3382.26 | bwd_inner_microstep: 0.74 | bwd_allreduce_microstep: 3381.43 | step_microstep: 2.03
[2025-11-06 19:11:56,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 333.47 | bwd: 3383.13 | bwd_inner: 1.47 | bwd_allreduce: 3381.49 | step: 2.13
100%|█████████▉| 3503/3507 [1:27:10<00:08,  2.10s/it]                                                     {'loss': 0.1178, 'learning_rate': 6.826152311290024e-11, 'epoch': 1.0}
100%|█████████▉| 3503/3507 [1:27:10<00:08,  2.10s/it]tensor([[-4.8750, -1.4688,  1.9297, -1.1875, -4.5312]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:56,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.49 | bwd_microstep: 0.91 | bwd_inner_microstep: 0.80 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-5.0312, -5.0938, -1.5391,  2.3594, -2.2188]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-5.5938, -4.3125, -0.5508,  1.0703, -3.4219]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-6.0625, -5.2812, -1.3125,  1.3203, -3.4219]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.3438, -4.0312,  0.5352,  2.3438, -3.1094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:0')
tensor([[-3.2969, -0.2490,  2.1250, -0.3242, -3.0781]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.3750, -5.1875, -0.9219,  3.1406, -2.3438]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-4.7500, -3.7031,  0.2949,  2.6094, -2.5000]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
[2025-11-06 19:11:56,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.14 | optimizer_gradients: 0.13 | optimizer_step: 0.14
[2025-11-06 19:11:56,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.17 | bwd_microstep: 54.50 | bwd_inner_microstep: 1.04 | bwd_allreduce_microstep: 53.38 | step_microstep: 1.25
[2025-11-06 19:11:56,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 322.69 | bwd: 55.41 | bwd_inner: 1.87 | bwd_allreduce: 53.42 | step: 1.33
100%|█████████▉| 3504/3507 [1:27:10<00:04,  1.60s/it]                                                     {'loss': 0.1046, 'learning_rate': 3.8397125862177363e-11, 'epoch': 1.0}
100%|█████████▉| 3504/3507 [1:27:10<00:04,  1.60s/it]tensor([[-5.0625, -2.3906,  1.2734, -0.2734, -4.0938]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:56,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 155.16 | bwd_microstep: 0.87 | bwd_inner_microstep: 0.76 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.08
tensor([[-7.7188, -6.0938, -1.0391,  0.6406, -5.0000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-6.8438, -3.0625,  2.8281, -0.0317, -5.7812]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -1.3828,  2.9531, -2.4062, -6.0938]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
tensor([[-4.1562, -1.5156,  2.3750,  0.9492, -3.3125]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.4062, -1.1875,  3.2031, -1.0938, -5.2188]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-3.7188, -4.2812, -1.5000,  3.1875, -0.8867]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-4.9688, -5.0938, -1.4766,  2.6250, -2.0469]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
[2025-11-06 19:11:58,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.31 | optimizer_gradients: 0.19 | optimizer_step: 0.19
[2025-11-06 19:11:58,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.43 | bwd_microstep: 1216.54 | bwd_inner_microstep: 1.27 | bwd_allreduce_microstep: 1215.17 | step_microstep: 179.72
[2025-11-06 19:11:58,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 414.62 | bwd: 1217.41 | bwd_inner: 2.05 | bwd_allreduce: 1215.22 | step: 179.81
100%|█████████▉| 3505/3507 [1:27:12<00:03,  1.67s/it]                                                     {'loss': 0.6657, 'learning_rate': 1.7065395339077583e-11, 'epoch': 1.0}
100%|█████████▉| 3505/3507 [1:27:12<00:03,  1.67s/it]tensor([[-3.0938, -3.7812, -2.6562,  1.0391, -0.6836]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-3.2188,  0.2070,  1.6250, -2.0469, -3.4688]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:58,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 152.36 | bwd_microstep: 0.74 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.06
tensor([[-5.1250, -0.8438,  3.6094, -1.4141, -5.2500]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:2')
tensor([[-5.8438, -5.2812, -0.6406,  2.9531, -2.8438]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.8125, -3.3594, -1.3047,  2.6875, -0.4141]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:3')
tensor([[-7.6562, -4.6875,  1.3750,  0.4316, -5.8125]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
tensor([[-1.8672,  0.0024,  1.1953,  0.4570, -1.3828]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-6.0938, -3.0156,  0.1797, -2.3750, -5.3125]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
[2025-11-06 19:11:58,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.13 | optimizer_gradients: 0.16 | optimizer_step: 0.23
[2025-11-06 19:11:58,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.28 | bwd_microstep: 93.73 | bwd_inner_microstep: 0.79 | bwd_allreduce_microstep: 92.86 | step_microstep: 1.62
[2025-11-06 19:11:58,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 318.67 | bwd: 94.47 | bwd_inner: 1.46 | bwd_allreduce: 92.90 | step: 1.68
100%|█████████▉| 3506/3507 [1:27:12<00:01,  1.30s/it]                                                     {'loss': 0.3002, 'learning_rate': 4.2663497445971646e-12, 'epoch': 1.0}
100%|█████████▉| 3506/3507 [1:27:12<00:01,  1.30s/it]tensor([[-0.9648,  2.3438,  2.6406, -1.9375, -2.0469]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
Failed to load video: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch3/ExtremeSportsPovStockFootage-Adventurevideory.com.mp4, the dataset is: sharegpt4v_instruct_gpt4-vision_cap100k
Failed to load video: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch19/coftn-FRANKLIN_INSIDER_-_MyPilgrvideoPal.mp4, the dataset is: sharegpt4v_instruct_gpt4-vision_cap100k
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
tensor([[-4.9375, -0.3535,  4.0312, -1.5234, -5.2500]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:1')
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
tensor([[-4.4688, -2.0938,  1.9766,  0.9180, -3.4375]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:11:59,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.47 | bwd_microstep: 0.84 | bwd_inner_microstep: 0.73 | bwd_allreduce_microstep: 0.03 | step_microstep: 0.07
tensor([[-5.3438, -4.9375, -0.8945,  2.1562, -2.7031]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Failed to load video: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch12/Big_Harp_video_Live_at_KDHX_9_5_15.mp4, the dataset is: sharegpt4v_instruct_gpt4-vision_cap100k
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Failed to load video: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch20/Pilgrvideo_to_Beethoven.mp4, the dataset is: sharegpt4v_instruct_gpt4-vision_cap100k
Failed to load video: /mnt/shared-storage-user/jiaziheng/tos/wenfarong/caolinhan/data/LSVQ/ia-batch21/Village_of_Romeoville_Ribbon_Cutting_-_videos_Hair_Care_April_14_2014.mp4, the dataset is: sharegpt4v_instruct_gpt4-vision_cap100k
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
tensor([[-2.1719,  2.1562,  2.8594, -3.1875, -3.5000]], device='cuda:3',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:3')
tensor([[-5.0312, -4.5625, -0.2100,  3.1250, -2.3125]], device='cuda:2',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:2')
tensor([[-5.2188, -3.8750,  0.3848,  2.2344, -3.0156]], device='cuda:1',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([3], device='cuda:1')
tensor([[-2.5156,  1.8047,  2.8906, -2.8438, -3.6094]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AddmmBackward0>) tensor([2], device='cuda:0')
[2025-11-06 19:12:01,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.23 | optimizer_gradients: 0.15 | optimizer_step: 0.14
[2025-11-06 19:12:01,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.20 | bwd_microstep: 1.36 | bwd_inner_microstep: 0.65 | bwd_allreduce_microstep: 0.63 | step_microstep: 2.69
[2025-11-06 19:12:01,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 423.68 | bwd: 2.20 | bwd_inner: 1.39 | bwd_allreduce: 0.66 | step: 2.76
100%|██████████| 3507/3507 [1:27:14<00:00,  1.59s/it]                                                     {'loss': 0.2335, 'learning_rate': 0.0, 'epoch': 1.0}
100%|██████████| 3507/3507 [1:27:14<00:00,  1.59s/it][INFO|trainer.py:1962] 2025-11-06 19:12:01,180 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


                                                     {'train_runtime': 5235.0373, 'train_samples_per_second': 5.359, 'train_steps_per_second': 0.67, 'train_loss': 0.5562806533278309, 'epoch': 1.0}
100%|██████████| 3507/3507 [1:27:15<00:00,  1.59s/it]100%|██████████| 3507/3507 [1:27:15<00:00,  1.49s/it]
[INFO|trainer.py:2936] 2025-11-06 19:12:01,644 >> Saving model checkpoint to /mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe
[INFO|configuration_utils.py:473] 2025-11-06 19:12:01,647 >> Configuration saved in /mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/config.json
[INFO|configuration_utils.py:594] 2025-11-06 19:12:01,648 >> Configuration saved in /mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/generation_config.json
[INFO|modeling_utils.py:2493] 2025-11-06 19:12:02,224 >> Model weights saved in /mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/model.safetensors
[INFO|tokenization_utils_base.py:2433] 2025-11-06 19:12:02,227 >> tokenizer config file saved in /mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/tokenizer_config.json
[INFO|tokenization_utils_base.py:2442] 2025-11-06 19:12:02,227 >> Special tokens file saved in /mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/special_tokens_map.json
[INFO|tokenization_utils_base.py:2493] 2025-11-06 19:12:02,227 >> added tokens file saved in /mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/added_tokens.json
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     0.5563
  train_runtime            = 1:27:15.03
  train_samples            =      28056
  train_samples_per_second =      5.359
  train_steps_per_second   =       0.67
[rank0]:[W1106 19:12:02.564791062 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1106 19:59:11.816000 2530955 site-packages/torch/distributed/run.py:792] 
W1106 19:59:11.816000 2530955 site-packages/torch/distributed/run.py:792] *****************************************
W1106 19:59:11.816000 2530955 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1106 19:59:11.816000 2530955 site-packages/torch/distributed/run.py:792] *****************************************
[2025-11-06 19:59:13,837] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:13,849] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:13,849] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:13,849] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-11-06 19:59:16,979] [INFO] [comm.py:652:init_distributed] cdb=None
11/06/2025 19:59:17 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-11-06 19:59:17,165] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 19:59:17,165] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[WARNING|logging.py:314] 2025-11-06 19:59:17,267 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor([[-6.0536e-09,  2.5391e-02,  2.3926e+38,  0.0000e+00,  6.4028e-10,
          8.3816e-31,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00, -1.4578e+32,  9.1835e-41,  0.0000e+00,  0.0000e+00,
         -6.6408e+20,  3.4474e-32,  2.3527e+38,  0.0000e+00,  8.1315e-20,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  4.6662e+07,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -8.1166e+20,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  1.2197e-19,  3.4667e-32,  2.3527e+38,  0.0000e+00,
          8.8960e+03,  1.8367e-40,  0.0000e+00,  0.0000e+00, -9.5923e+20,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  1.8974e-19,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  1.4510e-38,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  2.0463e-34,  1.2490e-38,  0.0000e+00,  0.0000e+00,
         -1.7616e+08,  1.2490e-38,  0.0000e+00,  0.0000e+00, -1.4324e+32,
          9.1835e-41,  0.0000e+00,  0.0000e+00, -1.1068e+21,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  2.9816e-19,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  6.1405e+09,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -1.3282e+21,  3.4474e-32,  2.3527e+38,  0.0000e+00,  4.3368e-19,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  4.5200e+02,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -1.6233e+21,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  7.0473e-19,  3.4667e-32,  2.3527e+38,  0.0000e+00,
         -3.9255e+22,  9.1835e-41,  0.0000e+00,  0.0000e+00, -1.9185e+21,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  1.0842e-18,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  1.2544e+04,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -2.2136e+21,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          1.6263e-18,  3.4667e-32,  2.3527e+38,  0.0000e+00,  6.1740e+09,
          1.8367e-40,  0.0000e+00,  0.0000e+00],
        [-2.6563e+21,  3.4474e-32,  2.3527e+38,  0.0000e+00,  2.6021e-18,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  3.7274e+05,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -3.2466e+21,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  3.9031e-18,  3.4667e-32,  2.3527e+38,  0.0000e+00,
          3.5095e-03,  1.8367e-40,  0.0000e+00,  0.0000e+00, -3.8369e+21,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  6.0715e-18,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  3.0518e-04,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -4.4272e+21,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          9.5410e-18,  3.4667e-32,  2.3527e+38,  0.0000e+00,  3.7478e+05,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -5.3127e+21,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  1.3878e-17,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  1.2608e+04,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -6.4933e+21,  3.4474e-32,  2.3527e+38,  0.0000e+00,  2.2551e-17,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  1.0486e+08,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -7.6738e+21,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  3.4694e-17,  3.4667e-32,  2.3527e+38,  0.0000e+00,
          3.7683e+05,  1.8367e-40,  0.0000e+00,  0.0000e+00, -8.8544e+21,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  5.2042e-17,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  1.0433e+08,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -1.0625e+22,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          8.3267e-17,  3.4667e-32,  2.3527e+38,  0.0000e+00,  3.1055e-01,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -1.2987e+22,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  1.2490e-16,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  3.7069e+05,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -1.5348e+22,  3.4474e-32,  2.3527e+38,  0.0000e+00,  1.9429e-16,
          3.4667e-32,  2.3527e+38,  0.0000e+00],
        [ 1.2480e+04,  1.8367e-40,  0.0000e+00,  0.0000e+00, -1.7709e+22,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  3.0531e-16,  3.4667e-32,
          2.3527e+38,  0.0000e+00, -2.2244e+27,  9.1835e-41,  0.0000e+00,
          0.0000e+00, -2.1251e+22,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          4.4409e-16,  3.4667e-32,  2.3527e+38,  0.0000e+00,  1.0591e+08,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -2.5973e+22,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  7.2164e-16,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  1.0538e+08,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -3.0695e+22,  3.4474e-32,  2.3527e+38,  0.0000e+00,  1.1102e-15,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  1.1280e+03,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -3.5418e+22,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  1.6653e-15,  3.4667e-32,  2.3527e+38,  0.0000e+00,
         -1.4071e+32,  9.1835e-41,  0.0000e+00,  0.0000e+00, -4.2501e+22,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  2.6645e-15,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  2.8687e-02,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -5.1946e+22,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          3.9968e-15,  3.4667e-32,  2.3527e+38,  0.0000e+00,  6.3479e-06,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -6.1391e+22,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  6.2172e-15,  3.4667e-32,  2.3527e+38,
          0.0000e+00, -3.8157e+24,  9.1835e-41,  0.0000e+00,  0.0000e+00,
         -7.0835e+22,  3.4474e-32,  2.3527e+38,  0.0000e+00,  9.7700e-15,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  3.7888e+05,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -8.5003e+22,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  1.4211e-14,  3.4667e-32,  2.3527e+38,  0.0000e+00,
          1.2672e+04,  1.8367e-40,  0.0000e+00,  0.0000e+00, -1.0389e+23,
          3.4474e-32,  2.3527e+38,  0.0000e+00],
        [ 2.3093e-14,  3.4667e-32,  2.3527e+38,  0.0000e+00,  1.0695e+08,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -1.2278e+23,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  3.5527e-14,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  1.0748e+08,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -1.4167e+23,  3.4474e-32,  2.3527e+38,  0.0000e+00,  5.3291e-14,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  1.0643e+08,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -1.7001e+23,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  8.5265e-14,  3.4667e-32,  2.3527e+38,  0.0000e+00,
         -3.6268e+25,  9.1835e-41,  0.0000e+00,  0.0000e+00, -2.0778e+23,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  1.2790e-13,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  1.2800e+04,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -2.4556e+23,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          1.9895e-13,  3.4667e-32,  2.3527e+38,  0.0000e+00,  3.8093e+05,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -2.8334e+23,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  3.1264e-13,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  1.0800e+08,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -3.4001e+23,  3.4474e-32,  2.3527e+38,  0.0000e+00,  4.5475e-13,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  1.2736e+04,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -4.1557e+23,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  7.3896e-13,  3.4667e-32,  2.3527e+38,  0.0000e+00,
          3.8298e+05,  1.8367e-40,  0.0000e+00,  0.0000e+00, -4.9113e+23,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  1.1369e-12,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  2.2602e-04,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -5.6668e+23,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          1.7053e-12,  3.4667e-32,  2.3527e+38,  0.0000e+00,  2.2697e-04,
          1.8367e-40,  0.0000e+00,  0.0000e+00],
        [-6.8002e+23,  3.4474e-32,  2.3527e+38,  0.0000e+00,  2.7285e-12,
          3.4667e-32,  2.3527e+38,  0.0000e+00, -8.5593e+22,  9.1835e-41,
          0.0000e+00,  0.0000e+00, -8.3114e+23,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  4.0927e-12,  3.4667e-32,  2.3527e+38,  0.0000e+00,
          3.8502e+05,  1.8367e-40,  0.0000e+00,  0.0000e+00, -9.8225e+23,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  6.3665e-12,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  3.8707e+05,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -1.1334e+24,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          1.0004e-11,  3.4667e-32,  2.3527e+38,  0.0000e+00,  3.8912e+05,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -1.3600e+24,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  1.4552e-11,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  2.8253e-05,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -1.6623e+24,  3.4474e-32,  2.3527e+38,  0.0000e+00,  2.3647e-11,
          3.4667e-32,  2.3527e+38,  0.0000e+00,  6.8283e-04,  1.8367e-40,
          0.0000e+00,  0.0000e+00, -1.9645e+24,  3.4474e-32,  2.3527e+38,
          0.0000e+00,  3.6380e-11,  3.4667e-32,  2.3527e+38,  0.0000e+00,
          1.0490e-05,  1.8367e-40,  0.0000e+00,  0.0000e+00, -2.2667e+24,
          3.4474e-32,  2.3527e+38,  0.0000e+00,  5.4570e-11,  3.4667e-32,
          2.3527e+38,  0.0000e+00,  1.0853e+08,  1.8367e-40,  0.0000e+00,
          0.0000e+00, -2.7201e+24,  3.4474e-32,  2.3527e+38,  0.0000e+00,
          8.7311e-11,  3.4667e-32,  2.3527e+38,  0.0000e+00,  1.2750e+01,
          1.8367e-40,  0.0000e+00,  0.0000e+00, -3.3245e+24,  3.4474e-32,
          2.3527e+38,  0.0000e+00,  1.3097e-10,  3.4667e-32,  2.3527e+38,
          0.0000e+00,  4.3600e+02,  1.8367e-40,  0.0000e+00,  0.0000e+00,
         -3.9290e+24,  3.4474e-32,  2.3527e+38,  0.0000e+00,  2.0373e-10,
          3.4667e-32,  2.3527e+38,  0.0000e+00]])
11/06/2025 19:59:17 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 19:59:17 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/runs/Nov06_19-59-17_gpu-lg-cmc-h-h200-0964.host.h.pjlab.org.cn,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=5000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=2,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.05,
)
11/06/2025 19:59:17 - INFO - __main__ - Loading Tokenizer: /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified
[INFO|tokenization_utils_base.py:2025] 2025-11-06 19:59:17,356 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 19:59:17,356 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2025] 2025-11-06 19:59:17,356 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 19:59:17,356 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 19:59:17,356 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 19:59:17,356 >> loading file tokenizer.json
[2025-11-06 19:59:17,390] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 19:59:17,390] [INFO] [comm.py:652:init_distributed] cdb=None
[WARNING|logging.py:314] 2025-11-06 19:59:17,510 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
11/06/2025 19:59:17 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2025-11-06 19:59:17,514 >> loading configuration file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/config.json
[INFO|configuration_utils.py:792] 2025-11-06 19:59:17,515 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "_name_or_path": "/mnt/shared-storage-user/jiaziheng/LMMS/internvl-pretrain-10_9_clip",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 3584,
  "image_fold": null,
  "llm_config": {
    "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151643,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 3584,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 18944,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "min_length": 0,
    "model_type": "qwen2",
    "moe_config": null,
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 28,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 28,
    "num_key_value_heads": 4,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
      "AutoModel": "modeling_intern_vit.InternVisionModel"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "capacity_factor": 1.2,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "eval_capacity_factor": 1.4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "laux_allreduce": "all_nodes",
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "moe_coeff_ratio": 0.5,
    "moe_intermediate_size": 768,
    "moe_output_scale": 4.0,
    "no_repeat_ngram_size": 0,
    "noisy_gate_policy": "RSample_before",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_experts": 8,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "num_routed_experts": 4,
    "num_shared_experts": 4,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "shared_expert_intermediate_size": 3072,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true,
    "use_moe": false,
    "use_residual": true,
    "use_rts": false,
    "use_weighted_residual": false
  }
}

11/06/2025 19:59:17 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3473] 2025-11-06 19:59:17,516 >> loading weights file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/model.safetensors
[INFO|modeling_utils.py:1426] 2025-11-06 19:59:17,536 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2025-11-06 19:59:17,537 >> Generate config GenerationConfig {}

tensor([[-8.4741e-33,  5.7903e-22,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]])
tensor([[ 2.0599e+30, -1.3043e+37,  1.8343e+38,  0.0000e+00,  5.4234e-31,
          6.1322e-31,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00, -1.1833e-29, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          1.5729e+07, -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.7355e-29,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.8874e+07, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -2.3666e-29, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  2.3069e+07, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -3.4710e-29, -7.5497e+07,  1.7945e+38,  0.0000e+00,  2.7263e+07,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -4.7332e-29, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  3.1457e+07, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -6.9420e-29, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          3.7749e+07, -7.5497e+07,  1.7945e+38,  0.0000e+00, -9.4663e-29,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  4.6137e+07, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -1.3884e-28, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  5.4526e+07, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -1.8933e-28, -7.5497e+07,  1.7945e+38,  0.0000e+00,  6.2915e+07,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -2.7768e-28, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  7.5497e+07, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -3.7865e-28, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          9.2275e+07, -7.5497e+07,  1.7945e+38,  0.0000e+00, -5.5536e-28,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.0905e+08, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -7.5731e-28, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  1.2583e+08, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -1.1107e-27, -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.5099e+08,
         -7.5497e+07,  1.7945e+38,  0.0000e+00],
        [-1.5146e-27, -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.8455e+08,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -2.2214e-27, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  2.1810e+08, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -3.0292e-27, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          2.5166e+08, -7.5497e+07,  1.7945e+38,  0.0000e+00, -4.4429e-27,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  3.0199e+08, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -6.0585e-27, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  3.6910e+08, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -8.8857e-27, -7.5497e+07,  1.7945e+38,  0.0000e+00,  4.3621e+08,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.2117e-26, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  5.0332e+08, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -1.7771e-26, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          6.0398e+08, -7.5497e+07,  1.7945e+38,  0.0000e+00, -2.4234e-26,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  7.3820e+08, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -3.5543e-26, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  8.7242e+08, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -4.8468e-26, -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.0066e+09,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -7.1086e-26, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  1.2080e+09, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -9.6935e-26, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          1.4764e+09, -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.4217e-25,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.7448e+09, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -1.9387e-25, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  2.0133e+09, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -2.8434e-25, -7.5497e+07,  1.7945e+38,  0.0000e+00,  2.4159e+09,
         -7.5497e+07,  1.7945e+38,  0.0000e+00],
        [-3.8774e-25, -7.5497e+07,  1.7945e+38,  0.0000e+00,  2.9528e+09,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -5.6869e-25, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  3.4897e+09, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -7.7548e-25, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          4.0265e+09, -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.1374e-24,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  4.8318e+09, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -1.5510e-24, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  5.9056e+09, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -2.2747e-24, -7.5497e+07,  1.7945e+38,  0.0000e+00,  6.9793e+09,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -3.1019e-24, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  8.0531e+09, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -4.5495e-24, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          9.6637e+09, -7.5497e+07,  1.7945e+38,  0.0000e+00, -6.2039e-24,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.1811e+10, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -9.0990e-24, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  1.3959e+10, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -1.2408e-23, -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.6106e+10,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.8198e-23, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  1.9327e+10, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -2.4815e-23, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          2.3622e+10, -7.5497e+07,  1.7945e+38,  0.0000e+00, -3.6396e-23,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  2.7917e+10, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -4.9631e-23, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  3.2212e+10, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -7.2792e-23, -7.5497e+07,  1.7945e+38,  0.0000e+00,  3.8655e+10,
         -7.5497e+07,  1.7945e+38,  0.0000e+00],
        [-9.9262e-23, -7.5497e+07,  1.7945e+38,  0.0000e+00,  4.7245e+10,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.4558e-22, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  5.5835e+10, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -1.9852e-22, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          6.4425e+10, -7.5497e+07,  1.7945e+38,  0.0000e+00, -2.9117e-22,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  7.7309e+10, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -3.9705e-22, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  9.4489e+10, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -5.8234e-22, -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.1167e+11,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -7.9409e-22, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  1.2885e+11, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -1.1647e-21, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          1.5462e+11, -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.5882e-21,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.8898e+11, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -2.3293e-21, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  2.2334e+11, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -3.1764e-21, -7.5497e+07,  1.7945e+38,  0.0000e+00,  2.5770e+11,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -4.6587e-21, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  3.0924e+11, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -6.3527e-21, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          3.7796e+11, -7.5497e+07,  1.7945e+38,  0.0000e+00, -9.3174e-21,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  4.4668e+11, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -1.2705e-20, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  5.1540e+11, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -1.8635e-20, -7.5497e+07,  1.7945e+38,  0.0000e+00,  6.1848e+11,
         -7.5497e+07,  1.7945e+38,  0.0000e+00],
        [-2.5411e-20, -7.5497e+07,  1.7945e+38,  0.0000e+00,  7.5591e+11,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -3.7269e-20, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  8.9335e+11, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -5.0822e-20, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          1.0308e+12, -7.5497e+07,  1.7945e+38,  0.0000e+00, -7.4539e-20,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.2370e+12, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -1.0164e-19, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  1.5118e+12, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -1.4908e-19, -7.5497e+07,  1.7945e+38,  0.0000e+00,  1.7867e+12,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -2.0329e-19, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  2.0616e+12, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  4.8468e-26, -6.4592e+08,  1.7945e+38,  0.0000e+00,
          2.4739e+12, -7.5497e+07,  1.7945e+38,  0.0000e+00, -2.9816e-19,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  3.0237e+12, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -4.0658e-19, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  3.5734e+12, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -5.9631e-19, -7.5497e+07,  1.7945e+38,  0.0000e+00,  4.1232e+12,
         -7.5497e+07,  1.7945e+38,  0.0000e+00, -8.1315e-19, -7.5497e+07,
          1.7945e+38,  0.0000e+00,  4.9478e+12, -7.5497e+07,  1.7945e+38,
          0.0000e+00, -1.1926e-18, -7.5497e+07,  1.7945e+38,  0.0000e+00,
          6.0473e+12, -7.5497e+07,  1.7945e+38,  0.0000e+00, -1.6263e-18,
         -7.5497e+07,  1.7945e+38,  0.0000e+00,  7.1468e+12, -7.5497e+07,
          1.7945e+38,  0.0000e+00, -2.3852e-18, -7.5497e+07,  1.7945e+38,
          0.0000e+00,  8.2463e+12, -7.5497e+07,  1.7945e+38,  0.0000e+00,
         -3.2526e-18, -7.5497e+07,  1.7945e+38,  0.0000e+00,  9.8956e+12,
         -7.5497e+07,  1.7945e+38,  0.0000e+00]])
11/06/2025 19:59:17 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 19:59:17 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:314] 2025-11-06 19:59:18,114 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-11-06 19:59:18,115 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor([[-2.3293e-21,  4.6322e-21,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]])
tensor([[-3.2817e-28, -9.4995e-07,  2.9509e+38,  0.0000e+00,  5.1722e-37,
          1.0171e-33,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00]])
tensor([[ 5.1651e+20,  1.5399e-27,  2.4059e+38,  0.0000e+00,  5.1651e+20,
          1.5399e-27,  2.4059e+38,  0.0000e+00, -3.0720e+04,  7.3339e-31,
          0.0000e+00,  0.0000e+00, -3.0720e+04,  7.3339e-31,  0.0000e+00,
          0.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00],
        [ 1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00]])
[INFO|modeling_utils.py:4350] 2025-11-06 19:59:18,364 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2025-11-06 19:59:18,364 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2025-11-06 19:59:18,368 >> loading configuration file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/generation_config.json
[INFO|configuration_utils.py:826] 2025-11-06 19:59:18,368 >> Generate config GenerationConfig {}

Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
11/06/2025 19:59:18 - INFO - __main__ - Finished
11/06/2025 19:59:18 - INFO - __main__ - model.config.force_image_size: 448
11/06/2025 19:59:18 - INFO - __main__ - data_args.force_image_size: 448
11/06/2025 19:59:18 - INFO - __main__ - model.config.vision_config.image_size: 448
11/06/2025 19:59:18 - INFO - __main__ - [Dataset] num_image_token: 256
11/06/2025 19:59:18 - INFO - __main__ - [Dataset] dynamic_image_size: True
11/06/2025 19:59:18 - INFO - __main__ - [Dataset] use_thumbnail: True
11/06/2025 19:59:18 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
11/06/2025 19:59:18 - INFO - __main__ - Formatting inputs...Skip in lazy mode
11/06/2025 19:59:18 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 28056
11/06/2025 19:59:18 - INFO - __main__ - quality.0.weight
11/06/2025 19:59:18 - INFO - __main__ - quality.0.bias
11/06/2025 19:59:18 - INFO - __main__ - quality.1.weight
11/06/2025 19:59:18 - INFO - __main__ - quality.1.bias
Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
[INFO|trainer.py:571] 2025-11-06 19:59:18,654 >> Using auto half precision backend
[2025-11-06 19:59:18,829] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2025-11-06 19:59:18,829] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
Total parameters: ~339.26 MB)
Trainable parameters: ~1.61 MB)
data_args.use_packed_ds False
[2025-11-06 19:59:21,311] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.13599872589111328 seconds
[2025-11-06 19:59:21,449] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-11-06 19:59:21,449] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-11-06 19:59:21,450] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-11-06 19:59:21,450] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-11-06 19:59:21,450] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2025-11-06 19:59:21,450] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2025-11-06 19:59:21,450] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2025-11-06 19:59:21,450] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-11-06 19:59:21,450] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.202911376953125 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.20151448249816895 seconds
Time to load fused_adam op: 0.20120024681091309 seconds
[2025-11-06 19:59:21,588] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-11-06 19:59:21,588] [INFO] [utils.py:782:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.66 GB         Max_CA 1 GB 
[2025-11-06 19:59:21,590] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 661.47 GB, percent = 48.4%
[2025-11-06 19:59:21,680] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-11-06 19:59:21,680] [INFO] [utils.py:782:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.66 GB         Max_CA 1 GB 
[2025-11-06 19:59:21,682] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 661.47 GB, percent = 48.4%
[2025-11-06 19:59:21,682] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized
[2025-11-06 19:59:21,766] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-11-06 19:59:21,766] [INFO] [utils.py:782:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.66 GB         Max_CA 1 GB 
[2025-11-06 19:59:21,767] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 661.47 GB, percent = 48.4%
[2025-11-06 19:59:21,768] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-11-06 19:59:21,768] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2025-11-06 19:59:21,768] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f07bb3b1650>
[2025-11-06 19:59:21,768] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2025-11-06 19:59:21,769] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f07c158fe90>
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-11-06 19:59:21,769] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   optimizer_name ............... adamw
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.05}
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   train_batch_size ............. 8
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... True
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   world_size ................... 4
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  False
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-11-06 19:59:21,770] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 1
[2025-11-06 19:59:21,770] [INFO] [config.py:989:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 2e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.05
        }
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 8, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2025-11-06 19:59:21,770 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-11-06 19:59:21,770 >>   Num examples = 28,056
[INFO|trainer.py:1723] 2025-11-06 19:59:21,770 >>   Num Epochs = 1
[INFO|trainer.py:1724] 2025-11-06 19:59:21,770 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-11-06 19:59:21,771 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1728] 2025-11-06 19:59:21,771 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1729] 2025-11-06 19:59:21,771 >>   Total optimization steps = 3,507
[INFO|trainer.py:1730] 2025-11-06 19:59:21,771 >>   Number of trainable parameters = 1,606,405
  0%|          | 0/3507 [00:00<?, ?it/s][2025-11-06 19:59:23,946] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:23,948] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:24,094] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:24,095] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:28,729] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:28,883] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:29,025] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:29,032] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:33,753] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:33,974] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:33,985] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:33,985] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:38,601] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:38,718] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:38,860] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 19:59:38,860] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W1106 20:00:59.093000 2535237 site-packages/torch/distributed/run.py:792] 
W1106 20:00:59.093000 2535237 site-packages/torch/distributed/run.py:792] *****************************************
W1106 20:00:59.093000 2535237 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1106 20:00:59.093000 2535237 site-packages/torch/distributed/run.py:792] *****************************************
[2025-11-06 20:01:01,099] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:01,103] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:01,110] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:01,110] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-11-06 20:01:04,070] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 20:01:04,071] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
11/06/2025 20:01:04 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 20:01:04 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe/runs/Nov06_20-01-04_gpu-lg-cmc-h-h200-0964.host.h.pjlab.org.cn,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/mnt/shared-storage-user/jiaziheng/LMMS/internvit-lsvq-11_6_FS_linear_probe,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=5000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=2,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.05,
)
11/06/2025 20:01:04 - INFO - __main__ - Loading Tokenizer: /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified
[INFO|tokenization_utils_base.py:2025] 2025-11-06 20:01:04,643 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 20:01:04,643 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2025] 2025-11-06 20:01:04,643 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 20:01:04,643 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 20:01:04,643 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2025-11-06 20:01:04,643 >> loading file tokenizer.json
[2025-11-06 20:01:04,687] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 20:01:04,687] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-11-06 20:01:04,687] [INFO] [comm.py:652:init_distributed] cdb=None
[WARNING|logging.py:314] 2025-11-06 20:01:04,791 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
11/06/2025 20:01:04 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2025-11-06 20:01:04,795 >> loading configuration file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/config.json
[INFO|configuration_utils.py:792] 2025-11-06 20:01:04,795 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "_name_or_path": "/mnt/shared-storage-user/jiaziheng/LMMS/internvl-pretrain-10_9_clip",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 3584,
  "image_fold": null,
  "llm_config": {
    "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151643,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 3584,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 18944,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "min_length": 0,
    "model_type": "qwen2",
    "moe_config": null,
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 28,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 28,
    "num_key_value_heads": 4,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
      "AutoModel": "modeling_intern_vit.InternVisionModel"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "capacity_factor": 1.2,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "eval_capacity_factor": 1.4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "laux_allreduce": "all_nodes",
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "moe_coeff_ratio": 0.5,
    "moe_intermediate_size": 768,
    "moe_output_scale": 4.0,
    "no_repeat_ngram_size": 0,
    "noisy_gate_policy": "RSample_before",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_experts": 8,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "num_routed_experts": 4,
    "num_shared_experts": 4,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "shared_expert_intermediate_size": 3072,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true,
    "use_moe": false,
    "use_residual": true,
    "use_rts": false,
    "use_weighted_residual": false
  }
}

11/06/2025 20:01:04 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3473] 2025-11-06 20:01:04,797 >> loading weights file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/model.safetensors
[INFO|modeling_utils.py:1426] 2025-11-06 20:01:04,816 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2025-11-06 20:01:04,817 >> Generate config GenerationConfig {}

tensor([[-9.0990e-24,  3.8654e-29,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]])
tensor([[-2.6000e+01,  9.4998e+14,  2.5787e+38,  0.0000e+00, -1.4279e+34,
          6.2862e-30,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00]])
11/06/2025 20:01:04 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 20:01:04 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
11/06/2025 20:01:04 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:314] 2025-11-06 20:01:05,031 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-11-06 20:01:05,036 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-11-06 20:01:05,036 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor([[-6.9420e-29,  8.4414e-26,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]])
tensor([[-4.2370e-33,  1.1833e-29,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]])
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1.]])
tensor([[ 4.7962e+20, -8.9111e-03,  2.3527e+38,  0.0000e+00, -6.7554e+16,
          1.7380e-30,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  1.8215e-04,         nan,  2.1719e+00,  1.8119e-12,
          5.3191e-37, -2.1932e+38, -2.0603e+38, -1.9274e+38,  3.5816e-38,
                 nan,  2.3690e-21,  4.7497e-07,  9.1195e-06, -2.4192e+38,
          8.4375e+00,  1.3086e-01,  1.1596e-10, -2.0071e+38,  4.3213e-02,
         -1.9805e+38, -2.4458e+38,  9.2704e-15, -2.5787e+38,  2.4298e-24,
         -2.1401e+38,  4.1979e-23,  3.7469e-37,  1.2005e-10,  1.5347e-33,
                 nan, -2.6452e+38,  2.9686e-08, -2.1666e+38,  3.4226e+10,
         -1.9008e+38,         nan,  3.3673e+12,  6.1389e-33,  5.3687e+08,
                 nan,  1.4470e+08,  2.3040e+03,  7.0083e-16,  3.9289e-31,
         -2.7914e+38,  2.0890e-12,  1.6794e+05,  1.4217e-25, -1.9673e+38,
          3.2640e+04, -2.1002e+38,  6.6580e-28,  5.3601e-22,  7.6186e+05,
         -2.7116e+38,  2.9025e+09, -2.1135e+38, -1.8742e+38, -2.1268e+38,
                 nan,         nan,  9.8271e-36,  3.1530e-14,  1.6609e-13,
          5.5879e-08, -5.2425e+15,  2.3129e+38,  0.0000e+00, -2.4759e+27,
          1.2398e-38,  0.0000e+00,  0.0000e+00, -7.9409e-23, -1.3084e+14,
          2.3129e+38,  0.0000e+00, -2.3211e+26,  1.2398e-38,  0.0000e+00,
          0.0000e+00, -2.6388e+13, -1.7733e+16,  2.3129e+38,  0.0000e+00,
         -2.3211e+26,  1.2398e-38,  0.0000e+00,  0.0000e+00, -3.2326e-38,
         -4.1746e+24,  2.3394e+38,  0.0000e+00, -2.3211e+26,  1.2398e-38,
          0.0000e+00,  0.0000e+00,  1.4648e-02, -4.9744e-03,  2.3527e+38,
          0.0000e+00,  4.9631e-23,  1.4453e+00,  2.3527e+38,  0.0000e+00,
          1.8056e-34, -1.1930e+14,  2.3129e+38,  0.0000e+00, -2.4759e+27,
          1.2398e-38,  0.0000e+00,  0.0000e+00],
        [ 2.6482e-34, -1.1930e+14,  2.3129e+38,  0.0000e+00, -2.3211e+26,
          1.2398e-38,  0.0000e+00,  0.0000e+00, -1.4418e+06, -1.1820e+14,
          2.3129e+38,  0.0000e+00,  1.4167e+22, -1.3743e-30,  2.3129e+38,
          0.0000e+00,  2.5411e-21, -4.9918e+14,  2.3129e+38,  0.0000e+00,
         -2.4759e+27,  1.2398e-38,  0.0000e+00,  0.0000e+00,  2.1362e-04,
         -1.1820e+14,  2.3129e+38,  0.0000e+00, -3.1691e+31,  1.2398e-38,
          0.0000e+00,  0.0000e+00,  1.8095e-25, -1.1820e+14,  2.3129e+38,
          0.0000e+00, -2.3211e+26,  1.2398e-38,  0.0000e+00,  0.0000e+00,
         -2.4339e+33, -1.3029e+14,  2.3129e+38,  0.0000e+00, -2.3211e+26,
          1.2398e-38,  0.0000e+00,  0.0000e+00,  7.9346e-04, -1.1820e+14,
          2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,  0.0000e+00,
          0.0000e+00,  4.3945e-03, -1.1820e+14,  2.3129e+38,  0.0000e+00,
         -2.3211e+26,  1.2398e-38,  0.0000e+00,  0.0000e+00,  2.8434e-25,
         -1.1820e+14,  2.3129e+38,  0.0000e+00, -2.3211e+26,  1.2398e-38,
          0.0000e+00,  0.0000e+00, -1.3511e+17, -4.0531e-06,  2.3527e+38,
          0.0000e+00, -1.0141e+33,  1.2398e-38,  0.0000e+00,  0.0000e+00,
         -1.1811e+10, -5.3832e+15,  2.3129e+38,  0.0000e+00, -3.1691e+31,
          1.2398e-38,  0.0000e+00,  0.0000e+00,  4.5495e-24, -1.1930e+14,
          2.3129e+38,  0.0000e+00, -2.3211e+26,  1.2398e-38,  0.0000e+00,
          0.0000e+00, -5.4234e-31, -4.7851e+15,  2.3129e+38,  0.0000e+00,
         -2.3211e+26,  1.2398e-38,  0.0000e+00,  0.0000e+00,  6.2039e-24,
         -1.1930e+14,  2.3129e+38,  0.0000e+00, -3.6762e+31,  1.2398e-38,
          0.0000e+00,  0.0000e+00,  9.0990e-24, -1.1930e+14,  2.3129e+38,
          0.0000e+00, -3.6762e+31,  1.2398e-38,  0.0000e+00,  0.0000e+00,
         -5.7220e-05, -1.1875e+14,  2.3129e+38,  0.0000e+00, -2.1299e+05,
          4.3809e+11,  2.3129e+38,  0.0000e+00],
        [-7.8643e+06,  1.4609e+00,  2.3527e+38,  0.0000e+00, -1.4418e+06,
          4.3809e+11,  2.3129e+38,  0.0000e+00,  4.6423e+27, -2.9512e+30,
          2.3394e+38,  0.0000e+00, -1.7654e+35,  1.2398e-38,  0.0000e+00,
          0.0000e+00,  1.2408e-23, -1.1930e+14,  2.3129e+38,  0.0000e+00,
         -1.7039e+06,  4.3809e+11,  2.3129e+38,  0.0000e+00,  1.8198e-23,
         -1.1930e+14,  2.3129e+38,  0.0000e+00, -1.9661e+06,  4.3809e+11,
          2.3129e+38,  0.0000e+00, -9.1553e-05, -1.1875e+14,  2.3129e+38,
          0.0000e+00, -9.8304e+05,  4.3809e+11,  2.3129e+38,  0.0000e+00,
          2.4815e-23, -1.1930e+14,  2.3129e+38,  0.0000e+00, -5.3248e+04,
          4.3809e+11,  2.3129e+38,  0.0000e+00, -1.3733e-04, -1.1875e+14,
          2.3129e+38,  0.0000e+00, -3.1691e+31,  1.2398e-38,  0.0000e+00,
          0.0000e+00, -2.1362e-04, -1.1875e+14,  2.3129e+38,  0.0000e+00,
         -3.1691e+31,  1.2398e-38,  0.0000e+00,  0.0000e+00,  3.6396e-23,
         -1.1930e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,
          0.0000e+00,  0.0000e+00, -3.3569e-04, -1.1875e+14,  2.3129e+38,
          0.0000e+00, -3.6762e+31,  1.2398e-38,  0.0000e+00,  0.0000e+00,
          4.9631e-23, -1.1930e+14,  2.3129e+38,  0.0000e+00, -2.3211e+26,
          1.2398e-38,  0.0000e+00,  0.0000e+00, -4.8828e-04, -1.1875e+14,
          2.3129e+38,  0.0000e+00, -2.3211e+26,  1.2398e-38,  0.0000e+00,
          0.0000e+00, -7.9346e-04, -1.1875e+14,  2.3129e+38,  0.0000e+00,
         -9.4868e-20,  1.2306e-38,  0.0000e+00,  0.0000e+00, -1.2207e-03,
         -1.1875e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,
          0.0000e+00,  0.0000e+00, -1.8311e-03, -1.1875e+14,  2.3129e+38,
          0.0000e+00, -2.3211e+26,  1.2398e-38,  0.0000e+00,  0.0000e+00,
         -2.1828e-10, -1.1820e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,
          1.2306e-38,  0.0000e+00,  0.0000e+00],
        [ 7.2792e-23, -1.1930e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,
          1.2306e-38,  0.0000e+00,  0.0000e+00, -2.9297e-03, -1.1875e+14,
          2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,  0.0000e+00,
          0.0000e+00, -1.1811e+10, -1.1820e+14,  2.3129e+38,  0.0000e+00,
         -3.2985e+12, -1.3435e-30,  2.3129e+38,  0.0000e+00,  3.6111e-34,
         -1.1930e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,
          0.0000e+00,  0.0000e+00,  5.2963e-34, -1.1930e+14,  2.3129e+38,
          0.0000e+00, -7.2222e-35, -3.0938e-30,  2.3129e+38,  0.0000e+00,
          7.2222e-34, -1.1930e+14,  2.3129e+38,  0.0000e+00,  1.9722e-31,
         -1.3189e-30,  2.3129e+38,  0.0000e+00,  8.4741e-33, -3.2765e+14,
          2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,  0.0000e+00,
          0.0000e+00,  7.7037e-33,  1.1939e-38,  0.0000e+00,  0.0000e+00,
         -9.4868e-20,  1.2306e-38,  0.0000e+00,  0.0000e+00,  4.4081e-39,
         -3.1226e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,
          0.0000e+00,  0.0000e+00, -1.9015e+31, -1.7663e+16,  2.3129e+38,
          0.0000e+00, -9.4868e-20,  1.2306e-38,  0.0000e+00,  0.0000e+00,
          5.7672e+06, -3.0347e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,
          1.2306e-38,  0.0000e+00,  0.0000e+00, -7.7884e+34, -3.0786e+14,
          2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,  0.0000e+00,
          0.0000e+00, -3.4667e-33, -1.1875e+14,  2.3129e+38,  0.0000e+00,
         -9.4868e-20,  1.2306e-38,  0.0000e+00,  0.0000e+00, -2.3666e-30,
         -1.1875e+14,  2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,
          0.0000e+00,  0.0000e+00,  6.6204e-35, -1.3139e+14,  2.3129e+38,
          0.0000e+00, -9.4868e-20,  1.2306e-38,  0.0000e+00,  0.0000e+00,
          8.5938e-02, -1.3469e+14,  2.3129e+38,  0.0000e+00, -3.4106e-12,
          7.4303e+11,  2.3129e+38,  0.0000e+00],
        [ 2.7940e-08, -3.2880e+30,  2.3394e+38,  0.0000e+00, -9.4868e-20,
          1.2306e-38,  0.0000e+00,  0.0000e+00,  2.3666e-30, -1.1820e+14,
          2.3129e+38,  0.0000e+00, -9.4868e-20,  1.2306e-38,  0.0000e+00,
          0.0000e+00,  4.7224e+21, -3.2765e+14,  2.3129e+38,  0.0000e+00,
         -1.1811e+10, -1.3066e-30,  2.3129e+38,  0.0000e+00,  4.3575e+29,
         -3.2285e+30,  2.3394e+38,  0.0000e+00,  2.3111e-33, -1.2881e-30,
          2.3129e+38,  0.0000e+00,  5.4469e+28, -1.7803e+16,  2.3129e+38,
          0.0000e+00, -4.9152e+05,  4.3809e+11,  2.3129e+38,  0.0000e+00,
         -1.1529e+19, -1.7733e+16,  2.3129e+38,  0.0000e+00, -3.6045e+05,
          4.3809e+11,  2.3129e+38,  0.0000e+00, -1.0308e+12, -1.3066e-30,
          2.3129e+38,  0.0000e+00, -7.5591e+11, -1.3066e-30,  2.3129e+38,
          0.0000e+00,  4.1962e-05, -4.9744e-03,  2.3527e+38,  0.0000e+00,
         -3.7796e+11, -1.3066e-30,  2.3129e+38,  0.0000e+00, -2.5770e+11,
         -1.3066e-30,  2.3129e+38,  0.0000e+00, -1.2288e+05,  4.3809e+11,
          2.3129e+38,  0.0000e+00,  2.0800e+02,  4.4238e+11,  2.3129e+38,
          0.0000e+00, -1.0650e+05,  4.3809e+11,  2.3129e+38,  0.0000e+00,
          3.2000e+02,  4.4238e+11,  2.3129e+38,  0.0000e+00, -1.4746e+05,
          4.3809e+11,  2.3129e+38,  0.0000e+00, -4.1505e+19, -1.7733e+16,
          2.3129e+38,  0.0000e+00, -1.8898e+11, -1.3066e-30,  2.3129e+38,
          0.0000e+00, -3.7138e+28, -1.8718e+16,  2.3129e+38,  0.0000e+00,
         -1.2677e+32,  1.2398e-38,  0.0000e+00,  0.0000e+00,  7.6800e+02,
          4.4238e+11,  2.3129e+38,  0.0000e+00, -6.3383e+31,  1.2398e-38,
          0.0000e+00,  0.0000e+00,  1.1520e+03,  4.4238e+11,  2.3129e+38,
          0.0000e+00, -6.3383e+31,  1.2398e-38,  0.0000e+00,  0.0000e+00,
          5.2662e-36, -4.4200e+14,  2.3129e+38,  0.0000e+00, -1.1796e+06,
          4.3809e+11,  2.3129e+38,  0.0000e+00]])
[INFO|modeling_utils.py:4350] 2025-11-06 20:01:05,716 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2025-11-06 20:01:05,716 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2025-11-06 20:01:05,721 >> loading configuration file /mnt/shared-storage-user/jiaziheng/LMMS/qualclip-internvit-SF-400M_11_6_modified/generation_config.json
[INFO|configuration_utils.py:826] 2025-11-06 20:01:05,721 >> Generate config GenerationConfig {}

11/06/2025 20:01:05 - INFO - __main__ - Finished
11/06/2025 20:01:05 - INFO - __main__ - model.config.force_image_size: 448
11/06/2025 20:01:05 - INFO - __main__ - data_args.force_image_size: 448
11/06/2025 20:01:05 - INFO - __main__ - model.config.vision_config.image_size: 448
11/06/2025 20:01:05 - INFO - __main__ - [Dataset] num_image_token: 256
11/06/2025 20:01:05 - INFO - __main__ - [Dataset] dynamic_image_size: True
11/06/2025 20:01:05 - INFO - __main__ - [Dataset] use_thumbnail: True
11/06/2025 20:01:05 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
11/06/2025 20:01:05 - INFO - __main__ - Formatting inputs...Skip in lazy mode
11/06/2025 20:01:05 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 28056
11/06/2025 20:01:05 - INFO - __main__ - vision_model.embeddings.class_embedding
11/06/2025 20:01:05 - INFO - __main__ - vision_model.embeddings.position_embedding
11/06/2025 20:01:05 - INFO - __main__ - vision_model.embeddings.patch_embedding.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.embeddings.patch_embedding.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.0.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.1.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.2.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.3.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.4.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.5.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.6.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.7.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.8.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.9.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.10.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.11.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.12.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.13.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.14.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.15.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.16.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.17.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.18.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.19.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.20.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.21.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.22.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.ls1
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.ls2
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.attn.qkv.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.attn.qkv.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.attn.proj.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.attn.proj.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc2.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.norm1.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.norm1.bias
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.norm2.weight
11/06/2025 20:01:05 - INFO - __main__ - vision_model.encoder.layers.23.norm2.bias
11/06/2025 20:01:05 - INFO - __main__ - quality.0.weight
11/06/2025 20:01:05 - INFO - __main__ - quality.0.bias
11/06/2025 20:01:05 - INFO - __main__ - quality.1.weight
11/06/2025 20:01:05 - INFO - __main__ - quality.1.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_blocks.0.conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_blocks.0.norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_blocks.0.norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_blocks.1.conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_blocks.1.norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_blocks.1.norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_fusion.conv_fast_to_slow.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_fusion.norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.0.multipathway_fusion.norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_fusion.conv_fast_to_slow.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_fusion.norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.1.multipathway_fusion.norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_fusion.conv_fast_to_slow.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_fusion.norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.2.multipathway_fusion.norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_fusion.conv_fast_to_slow.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_fusion.norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.3.multipathway_fusion.norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_conv.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_norm.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_norm.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight
11/06/2025 20:01:05 - INFO - __main__ - slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias
Total parameters: ~339.26 MB)
Trainable parameters: ~339.26 MB)
data_args.use_packed_ds False
[INFO|trainer.py:571] 2025-11-06 20:01:05,985 >> Using auto half precision backend
[2025-11-06 20:01:06,093] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2025-11-06 20:01:06,093] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
Total parameters: ~339.26 MB)
Trainable parameters: ~339.26 MB)
data_args.use_packed_ds False
Total parameters: ~339.26 MB)
Trainable parameters: ~339.26 MB)
data_args.use_packed_ds False
Total parameters: ~339.26 MB)
Trainable parameters: ~339.26 MB)
data_args.use_packed_ds False
[2025-11-06 20:01:08,132] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...

ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.136946439743042 seconds
[2025-11-06 20:01:08,270] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-11-06 20:01:08,270] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-11-06 20:01:08,299] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-11-06 20:01:08,300] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-11-06 20:01:08,300] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2025-11-06 20:01:08,300] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2025-11-06 20:01:08,300] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2025-11-06 20:01:08,300] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-11-06 20:01:08,300] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
Loading extension module fused_adam...
Time to load fused_adam op: 0.2017519474029541 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.20153093338012695 seconds
Time to load fused_adam op: 0.20163917541503906 seconds
[2025-11-06 20:01:09,096] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-11-06 20:01:09,096] [INFO] [utils.py:782:see_memory_usage] MA 0.95 GB         Max_MA 1.11 GB         CA 1.12 GB         Max_CA 1 GB 
[2025-11-06 20:01:09,098] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 661.88 GB, percent = 48.4%
[2025-11-06 20:01:09,568] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-11-06 20:01:09,569] [INFO] [utils.py:782:see_memory_usage] MA 0.95 GB         Max_MA 1.27 GB         CA 1.43 GB         Max_CA 1 GB 
[2025-11-06 20:01:09,574] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 660.66 GB, percent = 48.3%
[2025-11-06 20:01:09,574] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized
[2025-11-06 20:01:09,708] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-11-06 20:01:09,708] [INFO] [utils.py:782:see_memory_usage] MA 0.95 GB         Max_MA 0.95 GB         CA 1.43 GB         Max_CA 1 GB 
[2025-11-06 20:01:09,709] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 661.7 GB, percent = 48.4%
[2025-11-06 20:01:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-11-06 20:01:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2025-11-06 20:01:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f3f180f2650>
[2025-11-06 20:01:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2025-11-06 20:01:09,714] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-11-06 20:01:09,714] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f3f2091bb50>
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   optimizer_name ............... adamw
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.05}
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-11-06 20:01:09,715] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   train_batch_size ............. 8
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... True
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   world_size ................... 4
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  False
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-11-06 20:01:09,716] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 1
[2025-11-06 20:01:09,716] [INFO] [config.py:989:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 2e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.05
        }
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 8, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2025-11-06 20:01:09,716 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-11-06 20:01:09,716 >>   Num examples = 28,056
[INFO|trainer.py:1723] 2025-11-06 20:01:09,716 >>   Num Epochs = 1
[INFO|trainer.py:1724] 2025-11-06 20:01:09,716 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-11-06 20:01:09,716 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1728] 2025-11-06 20:01:09,716 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1729] 2025-11-06 20:01:09,716 >>   Total optimization steps = 3,507
[INFO|trainer.py:1730] 2025-11-06 20:01:09,718 >>   Number of trainable parameters = 339,263,181
  0%|          | 0/3507 [00:00<?, ?it/s][2025-11-06 20:01:12,264] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:12,266] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:12,266] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:12,267] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:17,290] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:17,371] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:17,526] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-11-06 20:01:17,526] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)